Lead Scoring Case Study¶
- Supervised Learning Algorithm - Logistic Regression [ Classification ]
- Programming Language - Python
- Developed by UV, VK, VY
Initialization - Preprocessing - Visualization¶
1. Package Imports and Data Initialization¶
%pip install fast_ml ## Required for constant feature identification package
from fast_ml.utilities import display_all
from fast_ml.feature_selection import get_constant_features
Requirement already satisfied: fast_ml in /home/vk/anaconda3/envs/dsbasic/lib/python3.10/site-packages (3.68) Note: you may need to restart the kernel to use updated packages.
# import all libraries numpy, pandas, matplotlib, seaborn.
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns; sns.set_theme(color_codes=True)
import warnings
warnings.filterwarnings('ignore')
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
%matplotlib inline
# Set custom display properties in pandas
pd.set_option("display.max_rows", 900)
pd.set_option("display.max_columns", 900)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# Import the necessary machine learning packages (sklearn, statsmodel) for performing logistic regression
import statsmodels.api as sm
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.metrics import accuracy_score, recall_score,precision_score, roc_auc_score, confusion_matrix, f1_score, roc_curve, precision_recall_curve
# Import the dataset from leads.csv
lead_score_df = pd.read_csv('Leads.csv')
lead_score_df.head(1)
| Prospect ID | Lead Number | Lead Origin | Lead Source | Do Not Email | Do Not Call | Converted | TotalVisits | Total Time Spent on Website | Page Views Per Visit | Last Activity | Country | Specialization | How did you hear about X Education | What is your current occupation | What matters most to you in choosing a course | Search | Magazine | Newspaper Article | X Education Forums | Newspaper | Digital Advertisement | Through Recommendations | Receive More Updates About Our Courses | Tags | Lead Quality | Update me on Supply Chain Content | Get updates on DM Content | Lead Profile | City | Asymmetrique Activity Index | Asymmetrique Profile Index | Asymmetrique Activity Score | Asymmetrique Profile Score | I agree to pay the amount through cheque | A free copy of Mastering The Interview | Last Notable Activity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7927b2df-8bba-4d29-b9a2-b6e0beafe620 | 660737 | API | Olark Chat | No | No | 0 | 0.000 | 0 | 0.000 | Page Visited on Website | NaN | Select | Select | Unemployed | Better Career Prospects | No | No | No | No | No | No | No | No | Interested in other courses | Low in Relevance | No | No | Select | Select | 02.Medium | 02.Medium | 15.000 | 15.000 | No | No | Modified |
lead_score_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9240 entries, 0 to 9239 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Prospect ID 9240 non-null object 1 Lead Number 9240 non-null int64 2 Lead Origin 9240 non-null object 3 Lead Source 9204 non-null object 4 Do Not Email 9240 non-null object 5 Do Not Call 9240 non-null object 6 Converted 9240 non-null int64 7 TotalVisits 9103 non-null float64 8 Total Time Spent on Website 9240 non-null int64 9 Page Views Per Visit 9103 non-null float64 10 Last Activity 9137 non-null object 11 Country 6779 non-null object 12 Specialization 7802 non-null object 13 How did you hear about X Education 7033 non-null object 14 What is your current occupation 6550 non-null object 15 What matters most to you in choosing a course 6531 non-null object 16 Search 9240 non-null object 17 Magazine 9240 non-null object 18 Newspaper Article 9240 non-null object 19 X Education Forums 9240 non-null object 20 Newspaper 9240 non-null object 21 Digital Advertisement 9240 non-null object 22 Through Recommendations 9240 non-null object 23 Receive More Updates About Our Courses 9240 non-null object 24 Tags 5887 non-null object 25 Lead Quality 4473 non-null object 26 Update me on Supply Chain Content 9240 non-null object 27 Get updates on DM Content 9240 non-null object 28 Lead Profile 6531 non-null object 29 City 7820 non-null object 30 Asymmetrique Activity Index 5022 non-null object 31 Asymmetrique Profile Index 5022 non-null object 32 Asymmetrique Activity Score 5022 non-null float64 33 Asymmetrique Profile Score 5022 non-null float64 34 I agree to pay the amount through cheque 9240 non-null object 35 A free copy of Mastering The Interview 9240 non-null object 36 Last Notable Activity 9240 non-null object dtypes: float64(4), int64(3), object(30) memory usage: 2.6+ MB
----------------------------------------------------------------------¶
2. Data Preprocesing - Part 1¶
Custom Functions for Preprocessing and EDA¶
# To distinguish numerical columns either as categorical/discrete or non categorical and return as dict
def classify_feature_dtype(df, cols):
d_categories = {'int_cat': [], "float_ts":[] }
for col in cols:
if (len(df[col].unique()) < 20) and (df[col].dtype != np.float64):
d_categories['int_cat'].append(col)
else:
if not isinstance(df[col][df[col].notna()].unique()[0], str):
d_categories['float_ts'].append(col)
else:
d_categories['int_cat'].append(col)
return d_categories
# Print all statistical information for a given set of columns
def show_stats(df, cols):
for col in list(cols):
print("Total Nulls: {0},\nMode: {1}".format(df[col].isna().sum(), df[col].mode()[0]))
if len(df[col].unique()) < 50:
print("\nUnique: {0}\n".format(df[col].unique()))
if (df[col].dtype == int) or (df[col].dtype == float):
print("Median : {0}, \nVariance: {1}, \n\nDescribe: {2} \n".format(df[col].median(), df[col].var(), df[col].describe()))
print("ValueCounts: {0} \n\n\n".format((df[col].value_counts(normalize=True) * 100).head(5)))
print("------------------------------------------------------------------")
# Return the percentage of null values in each columns in a dataframe
def check_cols_null_pct(df):
df_non_na = df.count() / len(df) # Ratio of non null values
df_na_pct = (1 - df_non_na) * 100 # Find the Percentage of null values
return df_na_pct.sort_values(ascending=False) # Sort the resulting values in descending order
# Generates charts based on the data type of the cols, as part of the univariate analysis
# it takes dataframe, columns, train data 0,1, and feature type as args.
def univariate_plots(df, cols, target=None, ftype=None, l_dict = None):
for col in cols:
#generate plots and graphs for category type. (generates piechart, countplot, boxplot / if training data is provided it generates bar chart instead)
if ftype == "categorical":
fig, axs = plt.subplots(1, 3, figsize=(20, 6))
col_idx = 0
axs[col_idx].pie(x=df[col].value_counts().head(12), labels=df[col].value_counts().head(12).index.str[:10], autopct="%1.1f%%",
radius=1, textprops={"fontsize": 10, "color": "Black"}, startangle=90, rotatelabels=False, )
axs[col_idx].set_title("PieChart of {0}".format(col), y=1); plt.xticks(rotation=45); plt.ylabel("Percentage")
fig.subplots_adjust(wspace=0.5, hspace=0.3)
col_idx += 1
sns.countplot(data=df, y=col, order=df[col].value_counts().head(15).index, palette="viridis", ax=axs[col_idx])
if (l_dict is not None) and (l_dict.get(col) is not None):
axs[col_idx].legend([ f'{k} - {v}' for k,v in l_dict[col].items()])
axs[col_idx].set_title("Countplot of {0}".format(col)); plt.xticks(rotation=45); plt.xlabel(col); plt.ylabel("Count")
fig.subplots_adjust(wspace=0.5, hspace=0.3)
col_idx += 1
ax = sns.barplot(data=df, x=df[col].str[:10], y=target, order=df[col].value_counts().index.str[:10], palette="viridis", ax=axs[col_idx], errwidth=0)
for i in ax.containers:
ax.bar_label(i,)
axs[col_idx].set_title('Barplot against target'); plt.xticks(rotation=90); plt.xlabel(col)
fig.subplots_adjust(wspace=0.5, hspace=0.3)
plt.suptitle("Univariate analysis of {0}".format(col), fontsize=12, y=0.95)
plt.tight_layout()
plt.subplots_adjust(top=0.85)
plt.show();
plt.clf()
#generate plots and graphs for numerical types. (generates boxplot, histplot, kdeplot, scatterplot)
elif ftype == "non_categorical":
fig, axs = plt.subplots(1, 4, figsize=(20, 6))
col_idx = 0
sns.boxplot(data=df, y=col, palette="viridis", flierprops=dict(marker="o", markersize=6, markerfacecolor="red", markeredgecolor="black"),
medianprops=dict(linestyle="-", linewidth=3, color="#FF9900"), whiskerprops=dict(linestyle="-", linewidth=2, color="black"),
capprops=dict(linestyle="-", linewidth=2, color="black"), ax=axs[col_idx])
axs[col_idx].set_title("Boxplot of {0}".format(col)); plt.xticks(rotation=45); plt.xlabel(col)
fig.subplots_adjust(wspace=0.5, hspace=0.3)
col_idx += 1
axs[col_idx].hist(data=df, x=col, label=col)
axs[col_idx].set_title("Histogram of {0}".format(col)); plt.xticks(rotation=45); plt.xlabel(col)
fig.subplots_adjust(wspace=0.5, hspace=0.3)
col_idx += 1
sns.kdeplot(df[col], shade=True, ax=axs[col_idx])
axs[col_idx].set_title("KDE plot of {0}".format(col)); plt.xticks(rotation=45); plt.xlabel(col)
fig.subplots_adjust(wspace=0.5, hspace=0.3)
col_idx += 1
sns.scatterplot(df[col], ax=axs[col_idx])
axs[col_idx].set_title("Scatterplot of {0}".format(col)); plt.xticks(rotation=45); plt.xlabel(col)
fig.subplots_adjust(wspace=0.5, hspace=0.3)
plt.suptitle("Univariate analysis of {0}".format(col), fontsize=12, y=0.95)
plt.tight_layout()
plt.subplots_adjust(top=0.85)
plt.show()
plt.clf()
# Perform Outlier analysis on the given dataframe.
# Find Lower threshold, Upper threshold and IQR values.
# Return the Result as a dataframe.
# find_outlier = True argument: restricts the output df to outlier columns. whereas find_outlier = False: returns results for all columns
def get_extremeval_threshld(df, find_outlier=False):
outlier_df = pd.DataFrame(columns=[i for i in df.columns if find_outlier == True], data=None)
for col in df.columns:
thirdq, firstq = df[col].quantile(0.75), df[col].quantile(0.25)
iqr = 1.5 * (thirdq - firstq)
extvalhigh, extvallow = iqr + thirdq, firstq - iqr
if find_outlier == True:
dfout = df.loc[(df[col] > extvalhigh) | (df[col] < extvallow)]
dfout = dfout.assign(name=col, thresh_low=extvallow, thresh_high=extvalhigh)
else:
dfout = pd.DataFrame([[col, extvallow, extvalhigh]], columns=['name', 'thresh_low', 'thresh_high'])
outlier_df = pd.concat([outlier_df, dfout])
# outlier_df = outlier_df.reset_index(drop=True)
outlier_df = outlier_df.set_index('name',drop=True)
return outlier_df
Data Cleaning, Column Renaming, Missing Value Imputation, Feature Selection¶
#duplicates row validation by id cols
print(f"{lead_score_df.index.is_unique}, {lead_score_df.columns.is_unique}, {lead_score_df['Prospect ID'].is_unique}, {lead_score_df['Lead Number'].is_unique}")
True, True, True, True
# drop unnecessary columns
lead_score_df = lead_score_df.drop(columns=['Prospect ID','Lead Number', 'I agree to pay the amount through cheque', 'Last Notable Activity'])
# rename columns that are too long
lead_score_df = lead_score_df.rename(columns={'Total Time Spent on Website':'ttime_on_site', 'Page Views Per Visit':'pg_view_pv', 'How did you hear about X Education':'info_abt_X_Edu', 'What is your current occupation':'curr_occupation',
'What matters most to you in choosing a course':'reason_behind_course', 'Receive More Updates About Our Courses':'more_course_updates', 'Update me on Supply Chain Content':'supply_chain_info', 'Get updates on DM Content':'get_dm',
'Asymmetrique Activity Index':'asym_activ_idx', 'Asymmetrique Profile Index':'asym_prof_idx', 'Asymmetrique Activity Score':'asym_activ_score', 'Asymmetrique Profile Score':'asym_prof_score',
'A free copy of Mastering The Interview':'avail_free_copy'})
# replace unnecessary space in columns with underscore and covert it to lower case
lead_score_df.columns = lead_score_df.columns.str.replace(pat=' ',repl='_', regex=True)
lead_score_df.columns = lead_score_df.columns.str.lower()
# Check the shape and size of the data frame
lead_score_df.head(1)
lead_score_df.dtypes
| lead_origin | lead_source | do_not_email | do_not_call | converted | totalvisits | ttime_on_site | pg_view_pv | last_activity | country | specialization | info_abt_x_edu | curr_occupation | reason_behind_course | search | magazine | newspaper_article | x_education_forums | newspaper | digital_advertisement | through_recommendations | more_course_updates | tags | lead_quality | supply_chain_info | get_dm | lead_profile | city | asym_activ_idx | asym_prof_idx | asym_activ_score | asym_prof_score | avail_free_copy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | API | Olark Chat | No | No | 0 | 0.000 | 0 | 0.000 | Page Visited on Website | NaN | Select | Select | Unemployed | Better Career Prospects | No | No | No | No | No | No | No | No | Interested in other courses | Low in Relevance | No | No | Select | Select | 02.Medium | 02.Medium | 15.000 | 15.000 | No |
lead_origin object lead_source object do_not_email object do_not_call object converted int64 totalvisits float64 ttime_on_site int64 pg_view_pv float64 last_activity object country object specialization object info_abt_x_edu object curr_occupation object reason_behind_course object search object magazine object newspaper_article object x_education_forums object newspaper object digital_advertisement object through_recommendations object more_course_updates object tags object lead_quality object supply_chain_info object get_dm object lead_profile object city object asym_activ_idx object asym_prof_idx object asym_activ_score float64 asym_prof_score float64 avail_free_copy object dtype: object
print(f'{lead_score_df.shape}, {lead_score_df.size}')
(9240, 33), 304920
# check null val percentage
# After checking the null value percentage for all the features
# We could see that there are many features that have more than 45% of non values
null_pct = check_cols_null_pct(lead_score_df)
null_pct[null_pct > 0]
lead_quality 51.591 asym_prof_score 45.649 asym_activ_score 45.649 asym_prof_idx 45.649 asym_activ_idx 45.649 tags 36.288 lead_profile 29.318 reason_behind_course 29.318 curr_occupation 29.113 country 26.634 info_abt_x_edu 23.885 specialization 15.563 city 15.368 pg_view_pv 1.483 totalvisits 1.483 last_activity 1.115 lead_source 0.390 dtype: float64
show_stats(lead_score_df,['tags','specialization'])
Total Nulls: 3353, Mode: Will revert after reading the email Unique: ['Interested in other courses' 'Ringing' 'Will revert after reading the email' nan 'Lost to EINS' 'In confusion whether part time or DLP' 'Busy' 'switched off' 'in touch with EINS' 'Already a student' 'Diploma holder (Not Eligible)' 'Graduation in progress' 'Closed by Horizzon' 'number not provided' 'opp hangup' 'Not doing further education' 'invalid number' 'wrong number given' 'Interested in full time MBA' 'Still Thinking' 'Lost to Others' 'Shall take in the next coming month' 'Lateral student' 'Interested in Next batch' 'Recognition issue (DEC approval)' 'Want to take admission but has financial problems' 'University not recognized'] ValueCounts: tags Will revert after reading the email 35.196 Ringing 20.435 Interested in other courses 8.714 Already a student 7.899 Closed by Horizzon 6.081 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 1438, Mode: Select Unique: ['Select' 'Business Administration' 'Media and Advertising' nan 'Supply Chain Management' 'IT Projects Management' 'Finance Management' 'Travel and Tourism' 'Human Resource Management' 'Marketing Management' 'Banking, Investment And Insurance' 'International Business' 'E-COMMERCE' 'Operations Management' 'Retail Management' 'Services Excellence' 'Hospitality Management' 'Rural and Agribusiness' 'Healthcare Management' 'E-Business'] ValueCounts: specialization Select 24.891 Finance Management 12.510 Human Resource Management 10.869 Marketing Management 10.741 Operations Management 6.447 Name: proportion, dtype: float64 ------------------------------------------------------------------
# replace select string with nan
lead_score_df[['tags','specialization']] = lead_score_df[['tags','specialization']].replace(to_replace=['select','Select',np.nan], value='unknown') ###--------------------->>>>>>>>>>>>>>>
lead_score_df = lead_score_df.replace(to_replace=['select','Select'], value=np.nan)
# validate select str is replaced
[i for i in lead_score_df.columns if 'select' in (lead_score_df[i].astype(str).str.lower()).str.findall('select').value_counts().index.map(''.join).to_list()]
[]
Constant Feature Identification¶
# check constant features that has only one values
# In the given data set there are a lot of features that have only single value as a category
# These are called as constant features and these features are of little relevance for the machine learning model hence we dropped those features
constant_features = get_constant_features(lead_score_df)
constant_features.head(10)
"','".join(constant_features['Var'].to_list())
| Desc | Var | Value | Perc | |
|---|---|---|---|---|
| 0 | Constant | magazine | No | 100.000 |
| 1 | Constant | more_course_updates | No | 100.000 |
| 2 | Constant | supply_chain_info | No | 100.000 |
| 3 | Constant | get_dm | No | 100.000 |
| 4 | Quasi Constant | x_education_forums | No | 99.989 |
| 5 | Quasi Constant | newspaper | No | 99.989 |
| 6 | Quasi Constant | do_not_call | No | 99.978 |
| 7 | Quasi Constant | newspaper_article | No | 99.978 |
| 8 | Quasi Constant | digital_advertisement | No | 99.957 |
| 9 | Quasi Constant | through_recommendations | No | 99.924 |
"magazine','more_course_updates','supply_chain_info','get_dm','x_education_forums','newspaper','do_not_call','newspaper_article','digital_advertisement','through_recommendations','search"
lead_score_df['reason_behind_course'].value_counts(normalize=True)*100
reason_behind_course Better Career Prospects 99.954 Flexibility & Convenience 0.031 Other 0.015 Name: proportion, dtype: float64
# drop all the constant_features
lead_score_df = lead_score_df.drop(['magazine', 'more_course_updates', 'supply_chain_info', 'get_dm', 'x_education_forums',
'newspaper', 'do_not_call', 'newspaper_article', 'digital_advertisement', 'through_recommendations', 'search','reason_behind_course'], axis=1)
# convert dtypes
# Convert the data type of certain features from object to category type
obj_cols = lead_score_df.select_dtypes(include='object').columns
lead_score_df[obj_cols] = lead_score_df[obj_cols].astype(dtype='category')
# check null val percentage
# Since we have replaced select as null value for certain columns we are seeing increase in the null value percentage for those features
# Therefore we dropped all the features that have more than 40 of null values
null_pct = check_cols_null_pct(lead_score_df)
null_pct[null_pct > 0]
lead_score_df = lead_score_df.drop(null_pct[null_pct > 40].index, axis=1)
info_abt_x_edu 78.463 lead_profile 74.188 lead_quality 51.591 asym_prof_score 45.649 asym_activ_score 45.649 asym_prof_idx 45.649 asym_activ_idx 45.649 city 39.708 curr_occupation 29.113 country 26.634 totalvisits 1.483 pg_view_pv 1.483 last_activity 1.115 lead_source 0.390 dtype: float64
null_pct = check_cols_null_pct(lead_score_df)
null_pct[null_pct > 0]
na_cols = null_pct[null_pct > 0].index
na_cols
city 39.708 curr_occupation 29.113 country 26.634 totalvisits 1.483 pg_view_pv 1.483 last_activity 1.115 lead_source 0.390 dtype: float64
Index(['city', 'curr_occupation', 'country', 'totalvisits', 'pg_view_pv',
'last_activity', 'lead_source'],
dtype='object')
lead_score_df.describe(include=np.number)
lead_score_df.describe(exclude=np.number)
| converted | totalvisits | ttime_on_site | pg_view_pv | |
|---|---|---|---|---|
| count | 9240.000 | 9103.000 | 9240.000 | 9103.000 |
| mean | 0.385 | 3.445 | 487.698 | 2.363 |
| std | 0.487 | 4.855 | 548.021 | 2.161 |
| min | 0.000 | 0.000 | 0.000 | 0.000 |
| 25% | 0.000 | 1.000 | 12.000 | 1.000 |
| 50% | 0.000 | 3.000 | 248.000 | 2.000 |
| 75% | 1.000 | 5.000 | 936.000 | 3.000 |
| max | 1.000 | 251.000 | 2272.000 | 55.000 |
| lead_origin | lead_source | do_not_email | last_activity | country | specialization | curr_occupation | tags | city | avail_free_copy | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 9240 | 9204 | 9240 | 9137 | 6779 | 9240 | 6550 | 9240 | 5571 | 9240 |
| unique | 5 | 21 | 2 | 17 | 38 | 19 | 6 | 27 | 6 | 2 |
| top | Landing Page Submission | No | Email Opened | India | unknown | Unemployed | unknown | Mumbai | No | |
| freq | 4886 | 2868 | 8506 | 3437 | 6492 | 3380 | 5600 | 3353 | 3222 | 6352 |
# impute missing categorical values using mode, if a particular value in that column has higher frequency say > 50%
for i in lead_score_df.select_dtypes(include='category'):
temp = lead_score_df[i].value_counts(normalize=True, ascending=False) * 100
if temp.iloc[0] > 50:
print(i)
lead_score_df[i] = lead_score_df[i].fillna(temp.index[0])
lead_origin do_not_email country curr_occupation city avail_free_copy
# Now there are only two columns that have more than 36 off missing values tags and specialization
null_pct = check_cols_null_pct(lead_score_df)
null_pct[null_pct > 0]
totalvisits 1.483 pg_view_pv 1.483 last_activity 1.115 lead_source 0.390 dtype: float64
show_stats(lead_score_df, lead_score_df.columns) ###--------------------->>>>>>>>>>>>>>>
Total Nulls: 0, Mode: Landing Page Submission Unique: ['API', 'Landing Page Submission', 'Lead Add Form', 'Lead Import', 'Quick Add Form'] Categories (5, object): ['API', 'Landing Page Submission', 'Lead Add Form', 'Lead Import', 'Quick Add Form'] ValueCounts: lead_origin Landing Page Submission 52.879 API 38.745 Lead Add Form 7.771 Lead Import 0.595 Quick Add Form 0.011 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 36, Mode: Google Unique: ['Olark Chat', 'Organic Search', 'Direct Traffic', 'Google', 'Referral Sites', ..., 'welearnblog_Home', 'youtubechannel', 'testone', 'Press_Release', 'NC_EDM'] Length: 22 Categories (21, object): ['Click2call', 'Direct Traffic', 'Facebook', 'Google', ..., 'google', 'testone', 'welearnblog_Home', 'youtubechannel'] ValueCounts: lead_source Google 31.160 Direct Traffic 27.629 Olark Chat 19.068 Organic Search 12.538 Reference 5.802 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: No Unique: ['No', 'Yes'] Categories (2, object): ['No', 'Yes'] ValueCounts: do_not_email No 92.056 Yes 7.944 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.23689009604963565, Describe: count 9240.000 mean 0.385 std 0.487 min 0.000 25% 0.000 50% 0.000 75% 1.000 max 1.000 Name: converted, dtype: float64 ValueCounts: converted 0 61.461 1 38.539 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 137, Mode: 0.0 Unique: [ 0. 5. 2. 1. 4. 8. 11. 6. 3. 7. 13. 17. nan 9. 12. 10. 16. 14. 21. 15. 22. 19. 18. 20. 43. 30. 23. 55. 141. 25. 27. 29. 24. 28. 26. 74. 41. 54. 115. 251. 32. 42.] Median : 3.0, Variance: 23.569594711063154, Describe: count 9103.000 mean 3.445 std 4.855 min 0.000 25% 1.000 50% 3.000 75% 5.000 max 251.000 Name: totalvisits, dtype: float64 ValueCounts: totalvisits 0.000 24.047 2.000 18.455 3.000 14.347 4.000 12.304 5.000 8.602 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Median : 248.0, Variance: 300327.5275216322, Describe: count 9240.000 mean 487.698 std 548.021 min 0.000 25% 12.000 50% 248.000 75% 936.000 max 2272.000 Name: ttime_on_site, dtype: float64 ValueCounts: ttime_on_site 0 23.734 60 0.206 74 0.195 75 0.195 127 0.195 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 137, Mode: 0.0 Median : 2.0, Variance: 4.6717267097228925, Describe: count 9103.000 mean 2.363 std 2.161 min 0.000 25% 1.000 50% 2.000 75% 3.000 max 55.000 Name: pg_view_pv, dtype: float64 ValueCounts: pg_view_pv 0.000 24.047 2.000 19.719 3.000 13.139 4.000 9.843 1.000 7.151 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 103, Mode: Email Opened Unique: ['Page Visited on Website', 'Email Opened', 'Unreachable', 'Converted to Lead', 'Olark Chat Conversation', ..., 'SMS Sent', 'Visited Booth in Tradeshow', 'Resubscribed to emails', 'Email Received', 'Email Marked Spam'] Length: 18 Categories (17, object): ['Approached upfront', 'Converted to Lead', 'Email Bounced', 'Email Link Clicked', ..., 'Unreachable', 'Unsubscribed', 'View in browser link Clicked', 'Visited Booth in Tradeshow'] ValueCounts: last_activity Email Opened 37.616 SMS Sent 30.043 Olark Chat Conversation 10.649 Page Visited on Website 7.004 Converted to Lead 4.684 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: India Unique: ['India', 'Russia', 'Kuwait', 'Oman', 'United Arab Emirates', ..., 'Denmark', 'Philippines', 'Bangladesh', 'Vietnam', 'Indonesia'] Length: 38 Categories (38, object): ['Asia/Pacific Region', 'Australia', 'Bahrain', 'Bangladesh', ..., 'United Kingdom', 'United States', 'Vietnam', 'unknown'] ValueCounts: country India 96.894 United States 0.747 United Arab Emirates 0.574 Singapore 0.260 Saudi Arabia 0.227 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: unknown Unique: ['unknown', 'Business Administration', 'Media and Advertising', 'Supply Chain Management', 'IT Projects Management', ..., 'Services Excellence', 'Hospitality Management', 'Rural and Agribusiness', 'Healthcare Management', 'E-Business'] Length: 19 Categories (19, object): ['Banking, Investment And Insurance', 'Business Administration', 'E-Business', 'E-COMMERCE', ..., 'Services Excellence', 'Supply Chain Management', 'Travel and Tourism', 'unknown'] ValueCounts: specialization unknown 36.580 Finance Management 10.563 Human Resource Management 9.177 Marketing Management 9.069 Operations Management 5.444 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Unemployed Unique: ['Unemployed', 'Student', 'Working Professional', 'Businessman', 'Other', 'Housewife'] Categories (6, object): ['Businessman', 'Housewife', 'Other', 'Student', 'Unemployed', 'Working Professional'] ValueCounts: curr_occupation Unemployed 89.719 Working Professional 7.641 Student 2.273 Other 0.173 Housewife 0.108 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: unknown Unique: ['Interested in other courses', 'Ringing', 'Will revert after reading the email', 'unknown', 'Lost to EINS', ..., 'Lateral student', 'Interested in Next batch', 'Recognition issue (DEC approval)', 'Want to take admission but has financial prob..., 'University not recognized'] Length: 27 Categories (27, object): ['Already a student', 'Busy', 'Closed by Horizzon', 'Diploma holder (Not Eligible)', ..., 'opp hangup', 'switched off', 'unknown', 'wrong number given'] ValueCounts: tags unknown 36.288 Will revert after reading the email 22.424 Ringing 13.019 Interested in other courses 5.552 Already a student 5.032 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Mumbai Unique: ['Mumbai', 'Thane & Outskirts', 'Other Metro Cities', 'Other Cities', 'Other Cities of Maharashtra', 'Tier II Cities'] Categories (6, object): ['Mumbai', 'Other Cities', 'Other Cities of Maharashtra', 'Other Metro Cities', 'Thane & Outskirts', 'Tier II Cities'] ValueCounts: city Mumbai 74.578 Thane & Outskirts 8.139 Other Cities 7.424 Other Cities of Maharashtra 4.946 Other Metro Cities 4.113 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: No Unique: ['No', 'Yes'] Categories (2, object): ['No', 'Yes'] ValueCounts: avail_free_copy No 68.745 Yes 31.255 Name: proportion, dtype: float64 ------------------------------------------------------------------
----------------------------------------------------------------------¶
3. Data Visualization - EDA¶
# univariate plots#
# Now we perform Univariate analysis on both categorical and numerical variables
dtype_dict = classify_feature_dtype(lead_score_df, lead_score_df.columns )
univariate_plots(lead_score_df, dtype_dict['float_ts'], ftype='non_categorical', target='converted')
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
Observations¶
- The converter column has more ones than zeros
- The box plot for total visits shows that there are outliers and the majority of the total visits fall between zero and 30
- The box plot for page view per visit shows that there are outliers majority of them fall between zero and 20
# univariate plots
cols = dtype_dict['int_cat'].copy()
cols.remove('converted')
univariate_plots(lead_score_df, cols, ftype='categorical', target='converted')
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
Observations¶
The categorical analysis¶
More than 50 per cent of the users are by originated landing page following which the apis have 38 of leeds
When it comes to lead origin conversion rates lead add form has higher conversion rates holistically all have similar probability rate
When it comes to lead source 30 are from google twenty seven percent are from direct traffic 19 are from olak chat
For the do not email feature 90 of them chose no while 7 of them chose yes therefore majority of them are interested in the edtech platform
Also the people who have said no also the users who said no have a higher percent of conversion rate
The last activity of majority of the users say 38 of users are email opened followed by sms sent therefore we can say that
Majority of the users are active on email conversations
Majority of the users are from india
We can we can assume that air tech has a higher popularity in india or higher interest in india higher demand in india
Among the employed users most of the interested users have finance management as a specialization followed by human resource management and marketing management almost all specialization has a similar conversion rate
Majority of the users are unemployed followed by working professionals
Working professionals have a higher conversion rates while the user count is low in comparison to unemployed
Majority of the users have chosen better career prospects as a reason for opting for the course
The majority of the leads have been tagged under will revert after reading the email
Maximum number of leads have been from the city of mumbai followed by thane almost all categories have similar probabilities of conversion
90 of the leads have opted not to free copy of the interview while 31 are opted for interview copy
# Bivariate plots
# The pair plot view shows that there is a slight correlation between total visits and page view per visit
sns.pairplot(lead_score_df)
plt.show();
# multivariate plots
# The heat map shows that majority of the features have no correlation or has lesser than 0 5 correlation
plt.figure(figsize = (20, 4)) # Size of the figure
sns.heatmap(lead_score_df.select_dtypes(exclude='category').corr(), annot = True)
plt.show();
3.1 Other Bivariate - Multivariate plots (computationally intensive)¶
# # Boxplots - numerical features against target
axs = 141
plt.figure(figsize=(26, 6))
for i in list(set(dtype_dict['float_ts']) - set(['date','converted','lead_number'])):
plt.subplot(axs)
sns.boxplot(y=i, x='converted', data=lead_score_df, palette='tab10')
axs += 1
plt.show();
# # Generate Bivariate Boxplots combinations for all the categorical vs continuous columns
x_lst = list(dtype_dict['int_cat']) # x_list variable contains all the categorical Columns
y_lst = list(dtype_dict['float_ts'])# y_list contains all Continuous feature type columns
axs = 1
for x_col in x_lst:
plt.figure(figsize=(26,72))
for y_col in y_lst:
plt.subplot(18,4,axs)
sns.boxplot(x=x_col, y=y_col, data=lead_score_df, palette='tab10')
plt.xticks(rotation=90)
axs += 1
plt.show();
axs = 1
plt.show();
# # Generate Bivariate Barplots combinations for all the categorical features
# matplotlib.rcParams['ytick.labelsize'] = 8
# matplotlib.rcParams['xtick.labelsize'] = 8
# y_lst = x_lst = list(set(dtype_dict['int_cat']) - set(['converted','lead_number'])) # x_list variable contains all the categorical Columns
# axs = 1
# for x_col in x_lst:
# plt.figure(figsize=(25,400))
# for y_col in y_lst:
# # x_f = lead_score_df[x_col]
# # y_f = lead_score_df[y_col]
# # if isinstance(x_f,(object,'category')):
# # x_f = x_f.astype('str',copy=True).str.slice(0,10)
# # if isinstance(y_f,(object,'category')):
# # y_f = y_f.astype('str',copy=True).str.slice(0,10)
# plt.subplot(60,4,axs)
# sns.barplot(x=x_col, y=y_col, hue ='converted' ,data=lead_score_df, palette='tab10')
# plt.xticks(rotation=90)
# # plt.ylabel(ylabel= lead_score_df[y_col].name, labelpad=-50)
# # plt.locator_params(tight=True)
# axs += 1
# plt.show();
# axs = 1
# plt.show();
# # Multivariate Boxplots -
# matplotlib.rcParams['ytick.labelsize'] = 8
# matplotlib.rcParams['xtick.labelsize'] = 8
# # matplotlib.rcParams['ytick.major.size'] = 2
# # matplotlib.rcParams['xtick.major.size'] = 2
# z_lst = x_lst = list(dtype_dict['int_cat'])
# y_lst = list(dtype_dict['float_ts'])
# axs = 1
# for x_col in x_lst:
# for y_col in y_lst:
# plt.figure(figsize=(25, 200))
# for z_col in z_lst:
# # x_f = lead_score_df[x_col]
# # y_f = lead_score_df[y_col]
# # if isinstance(x_f,(object,'category')):
# # x_f = x_f.astype('str',copy=True).str.slice(0,10)
# # if isinstance(y_f,(object,'category')):
# # y_f = y_f.astype('str',copy=True).str.slice(0,10)
# plt.subplot(25, 2, axs)
# sns.boxplot(x=x_col, y=y_col, hue=z_col,hue_order=lead_score_df[z_col].value_counts().head(10).index ,data=lead_score_df, palette='tab10')
# plt.xticks(rotation=90)
# # plt.locator_params(tight=True)
# plt.legend(loc='best')
# axs += 1
# plt.show();
# axs = 1
# print("--------------------------------------------------------------------------")
# # break
# plt.show();
# # Multivariate Boxplots -
# matplotlib.rcParams['ytick.labelsize'] = 8
# matplotlib.rcParams['xtick.labelsize'] = 8
# # matplotlib.rcParams['ytick.major.size'] = 2
# # matplotlib.rcParams['xtick.major.size'] = 2
# x_lst = list(dtype_dict['int_cat'])
# x_lst.remove('lead_origin')
# x_lst.remove('converted')
# z_lst = y_lst= x_lst
# axs = 1
# for x_col in x_lst:
# for y_col in y_lst:
# plt.figure(figsize=(25, 500))
# for z_col in z_lst:
# if (x_col != y_col) & (y_col != z_col) & (x_col != z_col):
# # x_f = lead_score_df[x_col]
# # y_f = lead_score_df[y_col]
# # if isinstance(x_f,(object,'category')):
# # x_f = x_f.astype('str',copy=True).str.slice(0,10)
# # if isinstance(y_f,(object,'category')):
# # y_f = y_f.astype('str',copy=True).str.slice(0,10)
# plt.subplot(100, 3, axs)
# sns.barplot(x=x_col, y=y_col, hue=z_col, hue_order=lead_score_df[z_col].value_counts().head(10).index, data=lead_score_df, palette='tab10')
# plt.xticks(rotation=90)
# # plt.ylabel(labelpad=3)
# # plt.locator_params.set_label_coords(-0.01, 0.5)
# plt.legend(loc='best')
# axs += 1
# plt.show()
# axs = 1
# print("--------------------------------------------------------------------------")
# plt.show();
----------------------------------------------------------------------¶
4. Data Preprocesing - Part 2¶
Outlier Analysis and Capping¶
# After performing eda we identified outliers in page view per visit totalvisits ttime on site columns we therefore cap all the values to the upper cut off and lower cut off of the iqr range
ex_val_df = get_extremeval_threshld(df=lead_score_df.select_dtypes(exclude=['category','object']) )
ex_val_df
lead_score_df.describe(percentiles=[.05,.1,.2,.5,.8,.9])
| thresh_low | thresh_high | |
|---|---|---|
| name | ||
| converted | -1.500 | 2.500 |
| totalvisits | -5.000 | 11.000 |
| ttime_on_site | -1374.000 | 2322.000 |
| pg_view_pv | -2.000 | 6.000 |
| converted | totalvisits | ttime_on_site | pg_view_pv | |
|---|---|---|---|---|
| count | 9240.000 | 9103.000 | 9240.000 | 9103.000 |
| mean | 0.385 | 3.445 | 487.698 | 2.363 |
| std | 0.487 | 4.855 | 548.021 | 2.161 |
| min | 0.000 | 0.000 | 0.000 | 0.000 |
| 5% | 0.000 | 0.000 | 0.000 | 0.000 |
| 10% | 0.000 | 0.000 | 0.000 | 0.000 |
| 20% | 0.000 | 0.000 | 0.000 | 0.000 |
| 50% | 0.000 | 3.000 | 248.000 | 2.000 |
| 80% | 1.000 | 5.000 | 1087.200 | 4.000 |
| 90% | 1.000 | 7.000 | 1380.000 | 5.000 |
| max | 1.000 | 251.000 | 2272.000 | 55.000 |
# Fix Outliers by setting either thresh low or thresh low for both extremes
lower_cutoff = ex_val_df.loc['pg_view_pv','thresh_low']
lead_score_df['pg_view_pv'] = np.where((lead_score_df['pg_view_pv'] < lower_cutoff), lower_cutoff, lead_score_df['pg_view_pv'])
upper_cutoff = ex_val_df.loc['pg_view_pv','thresh_high']
lead_score_df['pg_view_pv'] = np.where((lead_score_df['pg_view_pv'] > upper_cutoff), upper_cutoff, lead_score_df['pg_view_pv'])
# Fix Outliers by setting either thresh low or thresh low for both extremes
lower_cutoff = ex_val_df.loc['totalvisits','thresh_low']
lead_score_df['totalvisits'] = np.where((lead_score_df['totalvisits'] < lower_cutoff), lower_cutoff, lead_score_df['totalvisits'])
upper_cutoff = ex_val_df.loc['totalvisits','thresh_high']
lead_score_df['totalvisits'] = np.where((lead_score_df['totalvisits'] > upper_cutoff), upper_cutoff, lead_score_df['totalvisits'])
# Fix Outliers by setting either thresh low or thresh low for both extremes
lower_cutoff = ex_val_df.loc['ttime_on_site','thresh_low']
lead_score_df['ttime_on_site'] = np.where((lead_score_df['ttime_on_site'] < lower_cutoff), lower_cutoff, lead_score_df['ttime_on_site'])
upper_cutoff = ex_val_df.loc['ttime_on_site','thresh_high']
lead_score_df['ttime_on_site'] = np.where((lead_score_df['ttime_on_site'] > upper_cutoff), upper_cutoff, lead_score_df['ttime_on_site'])
lead_score_df.describe()
| converted | totalvisits | ttime_on_site | pg_view_pv | |
|---|---|---|---|---|
| count | 9240.000 | 9103.000 | 9240.000 | 9103.000 |
| mean | 0.385 | 3.221 | 487.698 | 2.259 |
| std | 0.487 | 2.882 | 548.021 | 1.793 |
| min | 0.000 | 0.000 | 0.000 | 0.000 |
| 25% | 0.000 | 1.000 | 12.000 | 1.000 |
| 50% | 0.000 | 3.000 | 248.000 | 2.000 |
| 75% | 1.000 | 5.000 | 936.000 | 3.000 |
| max | 1.000 | 11.000 | 2272.000 | 6.000 |
# Additionally there are few null values in their total visits and page view pub features therefore we replace with the mean of that particular column
lead_score_df['totalvisits'] = lead_score_df['totalvisits'].replace(to_replace=np.nan, value=lead_score_df['totalvisits'].median())
lead_score_df['pg_view_pv'] = lead_score_df['pg_view_pv'].replace(to_replace=np.nan, value=lead_score_df['pg_view_pv'].median())
# our null values have significantly reduced
null_pct = check_cols_null_pct(lead_score_df)
null_pct[null_pct>0]
lead_score_df.shape
last_activity 1.115 lead_source 0.390 dtype: float64
(9240, 14)
lead_score_df = lead_score_df.dropna(subset=['last_activity','lead_source'])
# lead_score_df.to_csv('spec_tag_analysis.csv')
lead_score_df = lead_score_df.drop(['country'], axis=1)
null_pct = check_cols_null_pct(lead_score_df)
null_pct[null_pct>0] ### there are no null values
lead_score_df.shape
Series([], dtype: float64)
(9103, 13)
lead_score_df.dtypes
lead_origin category lead_source category do_not_email category converted int64 totalvisits float64 ttime_on_site float64 pg_view_pv float64 last_activity category specialization category curr_occupation category tags category city category avail_free_copy category dtype: object
not lead_score_df['totalvisits'].dtype == np.float64
False
dtype_dict = classify_feature_dtype(lead_score_df, lead_score_df.columns )
univariate_plots(lead_score_df, dtype_dict['float_ts'], ftype='non_categorical', target='converted')
cols = dtype_dict['int_cat'].copy()
cols.remove('converted')
univariate_plots(lead_score_df, cols, ftype='categorical', target='converted')
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
# replace Yes, No with 1 and 0
lead_score_df = lead_score_df.replace(to_replace=['Yes', 'No'], value=[1, 0])
----------------------------------------------------------------------¶
5. Data Imbalance & Conversion Ratio¶
# Data Imbalance
# From the target variable we have found out the imbalance ratios around 62 therefore we decide not to rebalance
imbalance_ratio = sum(lead_score_df['converted'] == 1)/sum(lead_score_df['converted'] == 0) * 100
print(f'{round(imbalance_ratio, 2)}%')
61.09%
# Conversion Ratio
# From the target variable the conversion ratio is around 38 it shows that there is a very high probability of failure in conversion
converted = (sum(lead_score_df['converted'])/len(lead_score_df['converted'].index))*100
print(f'{round(converted, 2)}%')
37.92%
-----------------------------------------------------¶
# errorline # do no remove this line
-----------------------------------------------------¶
Approach - 01¶
Data Encoding¶
Dummy Encoding¶
# we perform dummy encoding
new_ls_df = pd.get_dummies(lead_score_df, columns=lead_score_df.select_dtypes('category').columns.difference(['tags','specialization']), drop_first=True, dtype=float)
new_ls_df.head(1)
new_ls_df.shape
| converted | totalvisits | ttime_on_site | pg_view_pv | specialization | tags | avail_free_copy_1 | city_Other Cities | city_Other Cities of Maharashtra | city_Other Metro Cities | city_Thane & Outskirts | city_Tier II Cities | curr_occupation_Housewife | curr_occupation_Other | curr_occupation_Student | curr_occupation_Unemployed | curr_occupation_Working Professional | do_not_email_1 | last_activity_Converted to Lead | last_activity_Email Bounced | last_activity_Email Link Clicked | last_activity_Email Marked Spam | last_activity_Email Opened | last_activity_Email Received | last_activity_Form Submitted on Website | last_activity_Had a Phone Conversation | last_activity_Olark Chat Conversation | last_activity_Page Visited on Website | last_activity_Resubscribed to emails | last_activity_SMS Sent | last_activity_Unreachable | last_activity_Unsubscribed | last_activity_View in browser link Clicked | last_activity_Visited Booth in Tradeshow | lead_origin_Landing Page Submission | lead_origin_Lead Add Form | lead_origin_Lead Import | lead_origin_Quick Add Form | lead_source_Direct Traffic | lead_source_Facebook | lead_source_Google | lead_source_Live Chat | lead_source_NC_EDM | lead_source_Olark Chat | lead_source_Organic Search | lead_source_Pay per Click Ads | lead_source_Press_Release | lead_source_Reference | lead_source_Referral Sites | lead_source_Social Media | lead_source_WeLearn | lead_source_Welingak Website | lead_source_bing | lead_source_blog | lead_source_google | lead_source_testone | lead_source_welearnblog_Home | lead_source_youtubechannel | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.000 | 0.000 | 0.000 | unknown | Interested in other courses | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
(9103, 58)
new_ls_df = pd.get_dummies(new_ls_df, columns=['tags','specialization'], dtype=float)
new_ls_df = new_ls_df.drop(['tags_unknown','specialization_unknown'], axis=1)
new_ls_df.shape
(9103, 100)
----------------------------------------------------------------------¶
Train and Test Split¶
X = new_ls_df.drop(['converted'], axis=1)
y = new_ls_df['converted']
# Now we split the dataset into train and test set
np.random.seed(0)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
----------------------------------------------------------------------¶
Feature Scaling¶
new_ls_df.dtypes
converted int64 totalvisits float64 ttime_on_site float64 pg_view_pv float64 avail_free_copy_1 float64 city_Other Cities float64 city_Other Cities of Maharashtra float64 city_Other Metro Cities float64 city_Thane & Outskirts float64 city_Tier II Cities float64 curr_occupation_Housewife float64 curr_occupation_Other float64 curr_occupation_Student float64 curr_occupation_Unemployed float64 curr_occupation_Working Professional float64 do_not_email_1 float64 last_activity_Converted to Lead float64 last_activity_Email Bounced float64 last_activity_Email Link Clicked float64 last_activity_Email Marked Spam float64 last_activity_Email Opened float64 last_activity_Email Received float64 last_activity_Form Submitted on Website float64 last_activity_Had a Phone Conversation float64 last_activity_Olark Chat Conversation float64 last_activity_Page Visited on Website float64 last_activity_Resubscribed to emails float64 last_activity_SMS Sent float64 last_activity_Unreachable float64 last_activity_Unsubscribed float64 last_activity_View in browser link Clicked float64 last_activity_Visited Booth in Tradeshow float64 lead_origin_Landing Page Submission float64 lead_origin_Lead Add Form float64 lead_origin_Lead Import float64 lead_origin_Quick Add Form float64 lead_source_Direct Traffic float64 lead_source_Facebook float64 lead_source_Google float64 lead_source_Live Chat float64 lead_source_NC_EDM float64 lead_source_Olark Chat float64 lead_source_Organic Search float64 lead_source_Pay per Click Ads float64 lead_source_Press_Release float64 lead_source_Reference float64 lead_source_Referral Sites float64 lead_source_Social Media float64 lead_source_WeLearn float64 lead_source_Welingak Website float64 lead_source_bing float64 lead_source_blog float64 lead_source_google float64 lead_source_testone float64 lead_source_welearnblog_Home float64 lead_source_youtubechannel float64 tags_Already a student float64 tags_Busy float64 tags_Closed by Horizzon float64 tags_Diploma holder (Not Eligible) float64 tags_Graduation in progress float64 tags_In confusion whether part time or DLP float64 tags_Interested in full time MBA float64 tags_Interested in Next batch float64 tags_Interested in other courses float64 tags_Lateral student float64 tags_Lost to EINS float64 tags_Lost to Others float64 tags_Not doing further education float64 tags_Recognition issue (DEC approval) float64 tags_Ringing float64 tags_Shall take in the next coming month float64 tags_Still Thinking float64 tags_University not recognized float64 tags_Want to take admission but has financial problems float64 tags_Will revert after reading the email float64 tags_in touch with EINS float64 tags_invalid number float64 tags_number not provided float64 tags_opp hangup float64 tags_switched off float64 tags_wrong number given float64 specialization_Banking, Investment And Insurance float64 specialization_Business Administration float64 specialization_E-Business float64 specialization_E-COMMERCE float64 specialization_Finance Management float64 specialization_Healthcare Management float64 specialization_Hospitality Management float64 specialization_Human Resource Management float64 specialization_IT Projects Management float64 specialization_International Business float64 specialization_Marketing Management float64 specialization_Media and Advertising float64 specialization_Operations Management float64 specialization_Retail Management float64 specialization_Rural and Agribusiness float64 specialization_Services Excellence float64 specialization_Supply Chain Management float64 specialization_Travel and Tourism float64 dtype: object
# Post split we perform standard scaling they fit and transform the train data set
# to_scale = ['lead_number', 'totalvisits', 'ttime_on_site', 'pg_view_pv']
to_scale = list(X.columns)
scaler = StandardScaler()
X_train[to_scale] = scaler.fit_transform(X_train[to_scale], y_train)
X_train.head(5)
| totalvisits | ttime_on_site | pg_view_pv | avail_free_copy_1 | city_Other Cities | city_Other Cities of Maharashtra | city_Other Metro Cities | city_Thane & Outskirts | city_Tier II Cities | curr_occupation_Housewife | curr_occupation_Other | curr_occupation_Student | curr_occupation_Unemployed | curr_occupation_Working Professional | do_not_email_1 | last_activity_Converted to Lead | last_activity_Email Bounced | last_activity_Email Link Clicked | last_activity_Email Marked Spam | last_activity_Email Opened | last_activity_Email Received | last_activity_Form Submitted on Website | last_activity_Had a Phone Conversation | last_activity_Olark Chat Conversation | last_activity_Page Visited on Website | last_activity_Resubscribed to emails | last_activity_SMS Sent | last_activity_Unreachable | last_activity_Unsubscribed | last_activity_View in browser link Clicked | last_activity_Visited Booth in Tradeshow | lead_origin_Landing Page Submission | lead_origin_Lead Add Form | lead_origin_Lead Import | lead_origin_Quick Add Form | lead_source_Direct Traffic | lead_source_Facebook | lead_source_Google | lead_source_Live Chat | lead_source_NC_EDM | lead_source_Olark Chat | lead_source_Organic Search | lead_source_Pay per Click Ads | lead_source_Press_Release | lead_source_Reference | lead_source_Referral Sites | lead_source_Social Media | lead_source_WeLearn | lead_source_Welingak Website | lead_source_bing | lead_source_blog | lead_source_google | lead_source_testone | lead_source_welearnblog_Home | lead_source_youtubechannel | tags_Already a student | tags_Busy | tags_Closed by Horizzon | tags_Diploma holder (Not Eligible) | tags_Graduation in progress | tags_In confusion whether part time or DLP | tags_Interested in full time MBA | tags_Interested in Next batch | tags_Interested in other courses | tags_Lateral student | tags_Lost to EINS | tags_Lost to Others | tags_Not doing further education | tags_Recognition issue (DEC approval) | tags_Ringing | tags_Shall take in the next coming month | tags_Still Thinking | tags_University not recognized | tags_Want to take admission but has financial problems | tags_Will revert after reading the email | tags_in touch with EINS | tags_invalid number | tags_number not provided | tags_opp hangup | tags_switched off | tags_wrong number given | specialization_Banking, Investment And Insurance | specialization_Business Administration | specialization_E-Business | specialization_E-COMMERCE | specialization_Finance Management | specialization_Healthcare Management | specialization_Hospitality Management | specialization_Human Resource Management | specialization_IT Projects Management | specialization_International Business | specialization_Marketing Management | specialization_Media and Advertising | specialization_Operations Management | specialization_Retail Management | specialization_Rural and Agribusiness | specialization_Services Excellence | specialization_Supply Chain Management | specialization_Travel and Tourism | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7740 | -0.779 | -0.376 | -0.712 | 1.471 | -0.284 | 4.347 | -0.208 | -0.293 | -0.086 | -0.029 | -0.042 | -0.152 | 0.335 | -0.284 | -0.290 | 4.477 | -0.192 | -0.176 | -0.017 | -0.781 | -0.017 | -0.122 | -0.054 | -0.345 | -0.275 | -0.012 | -0.647 | -0.101 | -0.078 | -0.029 | -0.012 | 0.925 | -0.265 | -0.063 | 0.000 | 1.619 | -0.064 | -0.687 | -0.017 | -0.012 | -0.485 | -0.381 | -0.012 | -0.017 | -0.231 | -0.120 | -0.017 | -0.012 | -0.119 | -0.017 | -0.012 | -0.023 | -0.012 | -0.012 | -0.012 | 4.372 | -0.141 | -0.195 | -0.086 | -0.114 | -0.023 | -0.112 | -0.020 | -0.245 | -0.020 | -0.138 | -0.029 | -0.127 | -0.012 | -0.389 | -0.012 | -0.023 | -0.017 | -0.020 | -0.532 | -0.035 | -0.091 | -0.050 | -0.062 | -0.164 | -0.072 | -0.198 | -0.214 | -0.079 | -0.113 | -0.345 | -0.133 | -0.111 | 3.113 | -0.208 | -0.140 | -0.318 | -0.150 | -0.232 | -0.105 | -0.088 | -0.063 | -0.199 | -0.150 |
| 7014 | 0.971 | 0.881 | -0.150 | -0.680 | -0.284 | -0.230 | -0.208 | -0.293 | -0.086 | -0.029 | -0.042 | -0.152 | 0.335 | -0.284 | -0.290 | -0.223 | -0.192 | -0.176 | -0.017 | 1.280 | -0.017 | -0.122 | -0.054 | -0.345 | -0.275 | -0.012 | -0.647 | -0.101 | -0.078 | -0.029 | -0.012 | 0.925 | -0.265 | -0.063 | 0.000 | -0.618 | -0.064 | 1.456 | -0.017 | -0.012 | -0.485 | -0.381 | -0.012 | -0.017 | -0.231 | -0.120 | -0.017 | -0.012 | -0.119 | -0.017 | -0.012 | -0.023 | -0.012 | -0.012 | -0.012 | -0.229 | -0.141 | -0.195 | -0.086 | -0.114 | -0.023 | -0.112 | -0.020 | -0.245 | -0.020 | -0.138 | -0.029 | -0.127 | -0.012 | -0.389 | -0.012 | -0.023 | -0.017 | -0.020 | 1.881 | -0.035 | -0.091 | -0.050 | -0.062 | -0.164 | -0.072 | -0.198 | -0.214 | -0.079 | -0.113 | 2.900 | -0.133 | -0.111 | -0.321 | -0.208 | -0.140 | -0.318 | -0.150 | -0.232 | -0.105 | -0.088 | -0.063 | -0.199 | -0.150 |
| 2431 | -0.429 | -0.337 | -0.150 | -0.680 | -0.284 | -0.230 | -0.208 | -0.293 | -0.086 | -0.029 | -0.042 | -0.152 | 0.335 | -0.284 | -0.290 | -0.223 | -0.192 | -0.176 | -0.017 | -0.781 | -0.017 | 8.189 | -0.054 | -0.345 | -0.275 | -0.012 | -0.647 | -0.101 | -0.078 | -0.029 | -0.012 | 0.925 | -0.265 | -0.063 | 0.000 | -0.618 | -0.064 | 1.456 | -0.017 | -0.012 | -0.485 | -0.381 | -0.012 | -0.017 | -0.231 | -0.120 | -0.017 | -0.012 | -0.119 | -0.017 | -0.012 | -0.023 | -0.012 | -0.012 | -0.012 | -0.229 | -0.141 | -0.195 | -0.086 | -0.114 | -0.023 | -0.112 | -0.020 | -0.245 | -0.020 | -0.138 | -0.029 | 7.894 | -0.012 | -0.389 | -0.012 | -0.023 | -0.017 | -0.020 | -0.532 | -0.035 | -0.091 | -0.050 | -0.062 | -0.164 | -0.072 | -0.198 | -0.214 | -0.079 | -0.113 | -0.345 | -0.133 | -0.111 | -0.321 | 4.808 | -0.140 | -0.318 | -0.150 | -0.232 | -0.105 | -0.088 | -0.063 | -0.199 | -0.150 |
| 3012 | 1.672 | 1.539 | 0.973 | -0.680 | -0.284 | -0.230 | -0.208 | -0.293 | -0.086 | -0.029 | -0.042 | -0.152 | 0.335 | -0.284 | -0.290 | -0.223 | -0.192 | -0.176 | -0.017 | 1.280 | -0.017 | -0.122 | -0.054 | -0.345 | -0.275 | -0.012 | -0.647 | -0.101 | -0.078 | -0.029 | -0.012 | -1.081 | -0.265 | -0.063 | 0.000 | -0.618 | -0.064 | -0.687 | -0.017 | -0.012 | -0.485 | 2.623 | -0.012 | -0.017 | -0.231 | -0.120 | -0.017 | -0.012 | -0.119 | -0.017 | -0.012 | -0.023 | -0.012 | -0.012 | -0.012 | -0.229 | -0.141 | -0.195 | -0.086 | -0.114 | -0.023 | -0.112 | -0.020 | -0.245 | -0.020 | -0.138 | -0.029 | -0.127 | -0.012 | -0.389 | -0.012 | -0.023 | -0.017 | -0.020 | 1.881 | -0.035 | -0.091 | -0.050 | -0.062 | -0.164 | -0.072 | -0.198 | -0.214 | -0.079 | -0.113 | -0.345 | 7.506 | -0.111 | -0.321 | -0.208 | -0.140 | -0.318 | -0.150 | -0.232 | -0.105 | -0.088 | -0.063 | -0.199 | -0.150 |
| 3140 | -1.130 | -0.889 | -1.274 | -0.680 | -0.284 | -0.230 | -0.208 | -0.293 | -0.086 | -0.029 | -0.042 | -0.152 | 0.335 | -0.284 | -0.290 | -0.223 | -0.192 | -0.176 | -0.017 | 1.280 | -0.017 | -0.122 | -0.054 | -0.345 | -0.275 | -0.012 | -0.647 | -0.101 | -0.078 | -0.029 | -0.012 | -1.081 | -0.265 | -0.063 | 0.000 | -0.618 | -0.064 | -0.687 | -0.017 | -0.012 | 2.062 | -0.381 | -0.012 | -0.017 | -0.231 | -0.120 | -0.017 | -0.012 | -0.119 | -0.017 | -0.012 | -0.023 | -0.012 | -0.012 | -0.012 | -0.229 | -0.141 | -0.195 | -0.086 | -0.114 | -0.023 | -0.112 | -0.020 | -0.245 | -0.020 | -0.138 | -0.029 | -0.127 | -0.012 | -0.389 | -0.012 | -0.023 | -0.017 | -0.020 | -0.532 | -0.035 | -0.091 | -0.050 | -0.062 | -0.164 | -0.072 | -0.198 | -0.214 | -0.079 | -0.113 | -0.345 | -0.133 | -0.111 | -0.321 | -0.208 | -0.140 | -0.318 | -0.150 | -0.232 | -0.105 | -0.088 | -0.063 | -0.199 | -0.150 |
----------------------------------------------------------------------¶
Model Building¶
Custom Functions for Model Training¶
# We create custom functions for model veiling since iteration we reuse certain functions again and again
# Train and predict function trains the model and predicts on the same data and returns the model its probability and predicted values based on cutoff
# The matrix function returns confusion matrix and accuracy score
# The vif function returns the vif score for the features
def logreg_train_pred_fn(fX_train, fy_train, fcol, fcutoff):
fX_train_sm = sm.add_constant(fX_train[fcol])
flogm = sm.GLM(fy_train, fX_train_sm, family = sm.families.Binomial())
fres = flogm.fit()
fy_train_pred = fres.predict(fX_train_sm)
fy_train_pred = fy_train_pred.values.reshape(-1)
fy_train_pred_final = pd.DataFrame({'Converted':fy_train.values, 'Conv_Prob':fy_train_pred})
fy_train_pred_final['ID'] = fy_train.index
fy_train_pred_final['predicted'] = fy_train_pred_final.Conv_Prob.map(lambda x: 1 if x > fcutoff else 0)
return fres, fy_train_pred,fy_train_pred_final
def logreg_metrics_fn(fy_train_pred_final):
fconfusion = confusion_matrix(fy_train_pred_final.Converted, fy_train_pred_final.predicted )
faccuracy = accuracy_score(fy_train_pred_final.Converted, fy_train_pred_final.predicted)
return fconfusion, faccuracy
def logreg_VIF_score_fn(fX_train, fcol):
fvif = pd.DataFrame()
fvif['Features'] = fX_train[fcol].columns
fvif['VIF'] = [variance_inflation_factor(fX_train[fcol].values, i) for i in range(fX_train[fcol].shape[1])]
fvif['VIF'] = round(fvif['VIF'], 2)
fvif = fvif.sort_values(by = "VIF", ascending = False)
return fvif
Base Model¶
# Logistic regression model
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
res = logm1.fit()
res.summary()
| Dep. Variable: | converted | No. Observations: | 7282 |
|---|---|---|---|
| Model: | GLM | Df Residuals: | 7183 |
| Model Family: | Binomial | Df Model: | 98 |
| Link Function: | Logit | Scale: | 1.0000 |
| Method: | IRLS | Log-Likelihood: | nan |
| Date: | Sun, 20 Oct 2024 | Deviance: | 69906. |
| Time: | 23:04:16 | Pearson chi2: | 3.42e+18 |
| No. Iterations: | 100 | Pseudo R-squ. (CS): | nan |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -2.981e+14 | 7.86e+05 | -3.79e+08 | 0.000 | -2.98e+14 | -2.98e+14 |
| totalvisits | 1.437e+14 | 1.3e+06 | 1.11e+08 | 0.000 | 1.44e+14 | 1.44e+14 |
| ttime_on_site | 9.087e+13 | 9.56e+05 | 9.51e+07 | 0.000 | 9.09e+13 | 9.09e+13 |
| pg_view_pv | -1.353e+14 | 1.44e+06 | -9.37e+07 | 0.000 | -1.35e+14 | -1.35e+14 |
| avail_free_copy_1 | -5.638e+13 | 1.18e+06 | -4.8e+07 | 0.000 | -5.64e+13 | -5.64e+13 |
| city_Other Cities | 1.495e+13 | 8.39e+05 | 1.78e+07 | 0.000 | 1.49e+13 | 1.49e+13 |
| city_Other Cities of Maharashtra | 6.152e+12 | 8.29e+05 | 7.42e+06 | 0.000 | 6.15e+12 | 6.15e+12 |
| city_Other Metro Cities | 6.251e+12 | 8.28e+05 | 7.55e+06 | 0.000 | 6.25e+12 | 6.25e+12 |
| city_Thane & Outskirts | -8.958e+12 | 8.34e+05 | -1.07e+07 | 0.000 | -8.96e+12 | -8.96e+12 |
| city_Tier II Cities | 2.437e+13 | 8e+05 | 3.05e+07 | 0.000 | 2.44e+13 | 2.44e+13 |
| curr_occupation_Housewife | 1.005e+14 | 1.12e+06 | 8.99e+07 | 0.000 | 1.01e+14 | 1.01e+14 |
| curr_occupation_Other | -5.502e+11 | 1.41e+06 | -3.91e+05 | 0.000 | -5.5e+11 | -5.5e+11 |
| curr_occupation_Student | 4.004e+13 | 4.18e+06 | 9.58e+06 | 0.000 | 4e+13 | 4e+13 |
| curr_occupation_Unemployed | 6.388e+13 | 8.31e+06 | 7.69e+06 | 0.000 | 6.39e+13 | 6.39e+13 |
| curr_occupation_Working Professional | 9.028e+13 | 7.3e+06 | 1.24e+07 | 0.000 | 9.03e+13 | 9.03e+13 |
| do_not_email_1 | -1.029e+14 | 1.13e+06 | -9.07e+07 | 0.000 | -1.03e+14 | -1.03e+14 |
| last_activity_Converted to Lead | -3.893e+14 | 5.96e+06 | -6.54e+07 | 0.000 | -3.89e+14 | -3.89e+14 |
| last_activity_Email Bounced | -3.002e+14 | 5.25e+06 | -5.72e+07 | 0.000 | -3e+14 | -3e+14 |
| last_activity_Email Link Clicked | -2.886e+14 | 4.78e+06 | -6.04e+07 | 0.000 | -2.89e+14 | -2.89e+14 |
| last_activity_Email Marked Spam | 2.959e+13 | 9.15e+05 | 3.23e+07 | 0.000 | 2.96e+13 | 2.96e+13 |
| last_activity_Email Opened | -7.692e+14 | 1.34e+07 | -5.72e+07 | 0.000 | -7.69e+14 | -7.69e+14 |
| last_activity_Email Received | -1.686e+13 | 9.13e+05 | -1.85e+07 | 0.000 | -1.69e+13 | -1.69e+13 |
| last_activity_Form Submitted on Website | -1.924e+14 | 3.43e+06 | -5.61e+07 | 0.000 | -1.92e+14 | -1.92e+14 |
| last_activity_Had a Phone Conversation | -1.307e+14 | 1.68e+06 | -7.77e+07 | 0.000 | -1.31e+14 | -1.31e+14 |
| last_activity_Olark Chat Conversation | -5.975e+14 | 8.57e+06 | -6.97e+07 | 0.000 | -5.98e+14 | -5.98e+14 |
| last_activity_Page Visited on Website | -4.675e+14 | 7.12e+06 | -6.56e+07 | 0.000 | -4.67e+14 | -4.67e+14 |
| last_activity_Resubscribed to emails | 1.59e+13 | 8.54e+05 | 1.86e+07 | 0.000 | 1.59e+13 | 1.59e+13 |
| last_activity_SMS Sent | -6.373e+14 | 1.27e+07 | -5.04e+07 | 0.000 | -6.37e+14 | -6.37e+14 |
| last_activity_Unreachable | -1.537e+14 | 2.87e+06 | -5.36e+07 | 0.000 | -1.54e+14 | -1.54e+14 |
| last_activity_Unsubscribed | -1.057e+14 | 2.31e+06 | -4.57e+07 | 0.000 | -1.06e+14 | -1.06e+14 |
| last_activity_View in browser link Clicked | -5.007e+13 | 1.12e+06 | -4.47e+07 | 0.000 | -5.01e+13 | -5.01e+13 |
| last_activity_Visited Booth in Tradeshow | -1.672e+13 | 8.62e+05 | -1.94e+07 | 0.000 | -1.67e+13 | -1.67e+13 |
| lead_origin_Landing Page Submission | -4.99e+13 | 1.6e+06 | -3.12e+07 | 0.000 | -4.99e+13 | -4.99e+13 |
| lead_origin_Lead Add Form | 1.512e+14 | 9.63e+06 | 1.57e+07 | 0.000 | 1.51e+14 | 1.51e+14 |
| lead_origin_Lead Import | 2.958e+14 | 4.34e+06 | 6.82e+07 | 0.000 | 2.96e+14 | 2.96e+14 |
| lead_origin_Quick Add Form | -6.8567 | 1.14e-07 | -5.99e+07 | 0.000 | -6.857 | -6.857 |
| lead_source_Direct Traffic | -1.521e+15 | 2.75e+07 | -5.54e+07 | 0.000 | -1.52e+15 | -1.52e+15 |
| lead_source_Facebook | -5.653e+14 | 5.86e+06 | -9.65e+07 | 0.000 | -5.65e+14 | -5.65e+14 |
| lead_source_Google | -1.609e+15 | 2.87e+07 | -5.62e+07 | 0.000 | -1.61e+15 | -1.61e+15 |
| lead_source_Live Chat | -2.153e+13 | 1.12e+06 | -1.93e+07 | 0.000 | -2.15e+13 | -2.15e+13 |
| lead_source_NC_EDM | 1.646e+13 | 1.07e+06 | 1.54e+07 | 0.000 | 1.65e+13 | 1.65e+13 |
| lead_source_Olark Chat | -1.224e+15 | 2.41e+07 | -5.08e+07 | 0.000 | -1.22e+15 | -1.22e+15 |
| lead_source_Organic Search | -1.105e+15 | 2.05e+07 | -5.4e+07 | 0.000 | -1.1e+15 | -1.1e+15 |
| lead_source_Pay per Click Ads | -7.3e+13 | 1.07e+06 | -6.84e+07 | 0.000 | -7.3e+13 | -7.3e+13 |
| lead_source_Press_Release | -9.626e+13 | 1.29e+06 | -7.46e+07 | 0.000 | -9.63e+13 | -9.63e+13 |
| lead_source_Reference | -9.203e+14 | 1.05e+07 | -8.77e+07 | 0.000 | -9.2e+14 | -9.2e+14 |
| lead_source_Referral Sites | -4.156e+14 | 7.33e+06 | -5.67e+07 | 0.000 | -4.16e+14 | -4.16e+14 |
| lead_source_Social Media | -4.731e+13 | 1.29e+06 | -3.67e+07 | 0.000 | -4.73e+13 | -4.73e+13 |
| lead_source_WeLearn | -5.926e+12 | 1.07e+06 | -5.55e+06 | 0.000 | -5.93e+12 | -5.93e+12 |
| lead_source_Welingak Website | -3.289e+14 | 5.63e+06 | -5.84e+07 | 0.000 | -3.29e+14 | -3.29e+14 |
| lead_source_bing | -1.19e+14 | 1.29e+06 | -9.25e+07 | 0.000 | -1.19e+14 | -1.19e+14 |
| lead_source_blog | -8.328e+13 | 1.07e+06 | -7.8e+07 | 0.000 | -8.33e+13 | -8.33e+13 |
| lead_source_google | -1.69e+14 | 1.64e+06 | -1.03e+08 | 0.000 | -1.69e+14 | -1.69e+14 |
| lead_source_testone | -7.475e+13 | 1.07e+06 | -6.99e+07 | 0.000 | -7.48e+13 | -7.48e+13 |
| lead_source_welearnblog_Home | -8.384e+13 | 1.07e+06 | -7.85e+07 | 0.000 | -8.38e+13 | -8.38e+13 |
| lead_source_youtubechannel | -8.404e+13 | 1.07e+06 | -7.83e+07 | 0.000 | -8.4e+13 | -8.4e+13 |
| tags_Already a student | -1.872e+14 | 8.69e+05 | -2.16e+08 | 0.000 | -1.87e+14 | -1.87e+14 |
| tags_Busy | 1.782e+14 | 8.18e+05 | 2.18e+08 | 0.000 | 1.78e+14 | 1.78e+14 |
| tags_Closed by Horizzon | 3.694e+14 | 9.07e+05 | 4.07e+08 | 0.000 | 3.69e+14 | 3.69e+14 |
| tags_Diploma holder (Not Eligible) | -8.49e+13 | 7.97e+05 | -1.07e+08 | 0.000 | -8.49e+13 | -8.49e+13 |
| tags_Graduation in progress | -2.813e+13 | 8.04e+05 | -3.5e+07 | 0.000 | -2.81e+13 | -2.81e+13 |
| tags_In confusion whether part time or DLP | -9.315e+13 | 7.88e+05 | -1.18e+08 | 0.000 | -9.31e+13 | -9.31e+13 |
| tags_Interested in full time MBA | -6.773e+13 | 8.03e+05 | -8.44e+07 | 0.000 | -6.77e+13 | -6.77e+13 |
| tags_Interested in Next batch | 9.869e+13 | 7.9e+05 | 1.25e+08 | 0.000 | 9.87e+13 | 9.87e+13 |
| tags_Interested in other courses | -1.556e+14 | 8.41e+05 | -1.85e+08 | 0.000 | -1.56e+14 | -1.56e+14 |
| tags_Lateral student | 9.589e+13 | 7.88e+05 | 1.22e+08 | 0.000 | 9.59e+13 | 9.59e+13 |
| tags_Lost to EINS | 2.478e+14 | 8.17e+05 | 3.03e+08 | 0.000 | 2.48e+14 | 2.48e+14 |
| tags_Lost to Others | -1.099e+14 | 7.91e+05 | -1.39e+08 | 0.000 | -1.1e+14 | -1.1e+14 |
| tags_Not doing further education | -7.189e+13 | 8.23e+05 | -8.74e+07 | 0.000 | -7.19e+13 | -7.19e+13 |
| tags_Recognition issue (DEC approval) | -4.783e+13 | 8.01e+05 | -5.97e+07 | 0.000 | -4.78e+13 | -4.78e+13 |
| tags_Ringing | -2.414e+14 | 8.93e+05 | -2.7e+08 | 0.000 | -2.41e+14 | -2.41e+14 |
| tags_Shall take in the next coming month | -4.423e+13 | 7.88e+05 | -5.62e+07 | 0.000 | -4.42e+13 | -4.42e+13 |
| tags_Still Thinking | -1.156e+13 | 7.92e+05 | -1.46e+07 | 0.000 | -1.16e+13 | -1.16e+13 |
| tags_University not recognized | -6.546e+13 | 7.9e+05 | -8.28e+07 | 0.000 | -6.55e+13 | -6.55e+13 |
| tags_Want to take admission but has financial problems | 1.814e+13 | 7.96e+05 | 2.28e+07 | 0.000 | 1.81e+13 | 1.81e+13 |
| tags_Will revert after reading the email | 8.643e+14 | 1.09e+06 | 7.94e+08 | 0.000 | 8.64e+14 | 8.64e+14 |
| tags_in touch with EINS | 5.208e+12 | 7.9e+05 | 6.59e+06 | 0.000 | 5.21e+12 | 5.21e+12 |
| tags_invalid number | -6.92e+13 | 7.99e+05 | -8.66e+07 | 0.000 | -6.92e+13 | -6.92e+13 |
| tags_number not provided | -8.165e+13 | 7.93e+05 | -1.03e+08 | 0.000 | -8.16e+13 | -8.16e+13 |
| tags_opp hangup | -7.313e+13 | 7.93e+05 | -9.23e+07 | 0.000 | -7.31e+13 | -7.31e+13 |
| tags_switched off | -1.147e+14 | 8.17e+05 | -1.41e+08 | 0.000 | -1.15e+14 | -1.15e+14 |
| tags_wrong number given | -2.666e+14 | 8.07e+05 | -3.3e+08 | 0.000 | -2.67e+14 | -2.67e+14 |
| specialization_Banking, Investment And Insurance | -4.518e+13 | 9.57e+05 | -4.72e+07 | 0.000 | -4.52e+13 | -4.52e+13 |
| specialization_Business Administration | -1.49e+13 | 9.89e+05 | -1.51e+07 | 0.000 | -1.49e+13 | -1.49e+13 |
| specialization_E-Business | -2.845e+12 | 8.23e+05 | -3.46e+06 | 0.000 | -2.85e+12 | -2.85e+12 |
| specialization_E-COMMERCE | -2.558e+13 | 8.57e+05 | -2.99e+07 | 0.000 | -2.56e+13 | -2.56e+13 |
| specialization_Finance Management | -5.568e+13 | 1.18e+06 | -4.73e+07 | 0.000 | -5.57e+13 | -5.57e+13 |
| specialization_Healthcare Management | -2.026e+13 | 8.77e+05 | -2.31e+07 | 0.000 | -2.03e+13 | -2.03e+13 |
| specialization_Hospitality Management | -2.688e+13 | 8.45e+05 | -3.18e+07 | 0.000 | -2.69e+13 | -2.69e+13 |
| specialization_Human Resource Management | -6.01e+13 | 1.13e+06 | -5.3e+07 | 0.000 | -6.01e+13 | -6.01e+13 |
| specialization_IT Projects Management | -1.731e+13 | 9.86e+05 | -1.76e+07 | 0.000 | -1.73e+13 | -1.73e+13 |
| specialization_International Business | -4.174e+13 | 8.82e+05 | -4.73e+07 | 0.000 | -4.17e+13 | -4.17e+13 |
| specialization_Marketing Management | -2.795e+13 | 1.11e+06 | -2.53e+07 | 0.000 | -2.79e+13 | -2.79e+13 |
| specialization_Media and Advertising | -4.26e+13 | 9.01e+05 | -4.73e+07 | 0.000 | -4.26e+13 | -4.26e+13 |
| specialization_Operations Management | -4.901e+13 | 1.01e+06 | -4.86e+07 | 0.000 | -4.9e+13 | -4.9e+13 |
| specialization_Retail Management | -4.408e+13 | 8.47e+05 | -5.2e+07 | 0.000 | -4.41e+13 | -4.41e+13 |
| specialization_Rural and Agribusiness | -9.661e+12 | 8.34e+05 | -1.16e+07 | 0.000 | -9.66e+12 | -9.66e+12 |
| specialization_Services Excellence | -1.614e+13 | 8.16e+05 | -1.98e+07 | 0.000 | -1.61e+13 | -1.61e+13 |
| specialization_Supply Chain Management | -4.553e+13 | 9.65e+05 | -4.72e+07 | 0.000 | -4.55e+13 | -4.55e+13 |
| specialization_Travel and Tourism | -7.602e+13 | 9.16e+05 | -8.3e+07 | 0.000 | -7.6e+13 | -7.6e+13 |
RFE - Recursive Feature Elimination¶
# Since the data set has a lot of features we perform rfe to eliminate insignificant features
logreg = LogisticRegression()
rfe = RFE(estimator=logreg, n_features_to_select=15) # running RFE with 15 variables as output
rfe = rfe.fit(X_train, y_train)
rfe_feature_Ranking = list(zip(X_train.columns, rfe.support_, rfe.ranking_))
rfe_sorted = sorted(rfe_feature_Ranking, key=lambda x : x[2])
rfe_sorted[:15]
[('ttime_on_site', True, 1),
('last_activity_Olark Chat Conversation', True, 1),
('last_activity_SMS Sent', True, 1),
('lead_origin_Lead Add Form', True, 1),
('lead_source_Olark Chat', True, 1),
('lead_source_Welingak Website', True, 1),
('tags_Already a student', True, 1),
('tags_Closed by Horizzon', True, 1),
('tags_Interested in other courses', True, 1),
('tags_Lost to EINS', True, 1),
('tags_Not doing further education', True, 1),
('tags_Ringing', True, 1),
('tags_Will revert after reading the email', True, 1),
('tags_switched off', True, 1),
('tags_wrong number given', True, 1)]
col = X_train.columns[rfe.support_]
# X_train.columns[~rfe.support_]
Model 1¶
# Now we perform model iteration as many times as possible till we get an optimum result
cutoff = 0.5
res, y_train_pred,y_train_pred_final = logreg_train_pred_fn(X_train, y_train, col, cutoff)
confusion, accuracy = logreg_metrics_fn(y_train_pred_final)
vif = logreg_VIF_score_fn(X_train, col)
print('Model Summary:') # Model Summary:
res.summary()
print('\nVIF Score:') # VIF Score:
vif
# print('\nY_Predicted Values:') # Y_Predicted Values:
# y_train_pred
# print('\nY_Predicted Cutoff:') # Y_Predicted Cutoff:
# y_train_pred_final
# print('\nConfusion Matrix:') # Confusion Matrix:
# confusion
# print(f'\nAccuracy Score: {accuracy}\n') # Accuracy Score:
Model Summary:
| Dep. Variable: | converted | No. Observations: | 7282 |
|---|---|---|---|
| Model: | GLM | Df Residuals: | 7266 |
| Model Family: | Binomial | Df Model: | 15 |
| Link Function: | Logit | Scale: | 1.0000 |
| Method: | IRLS | Log-Likelihood: | -1601.1 |
| Date: | Sun, 20 Oct 2024 | Deviance: | 3202.1 |
| Time: | 23:04:22 | Pearson chi2: | 8.54e+03 |
| No. Iterations: | 22 | Pseudo R-squ. (CS): | 0.5892 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -0.8841 | 60.489 | -0.015 | 0.988 | -119.440 | 117.672 |
| ttime_on_site | 1.0554 | 0.052 | 20.110 | 0.000 | 0.953 | 1.158 |
| last_activity_Olark Chat Conversation | -0.4699 | 0.064 | -7.347 | 0.000 | -0.595 | -0.345 |
| last_activity_SMS Sent | 0.8580 | 0.046 | 18.699 | 0.000 | 0.768 | 0.948 |
| lead_origin_Lead Add Form | 0.4000 | 0.091 | 4.405 | 0.000 | 0.222 | 0.578 |
| lead_source_Olark Chat | 0.5871 | 0.051 | 11.412 | 0.000 | 0.486 | 0.688 |
| lead_source_Welingak Website | 0.5735 | 0.126 | 4.555 | 0.000 | 0.327 | 0.820 |
| tags_Already a student | -0.7594 | 0.156 | -4.853 | 0.000 | -1.066 | -0.453 |
| tags_Closed by Horizzon | 1.2012 | 0.136 | 8.801 | 0.000 | 0.934 | 1.469 |
| tags_Interested in other courses | -0.4510 | 0.074 | -6.087 | 0.000 | -0.596 | -0.306 |
| tags_Lost to EINS | 0.6964 | 0.071 | 9.754 | 0.000 | 0.556 | 0.836 |
| tags_Not doing further education | -0.3755 | 0.126 | -2.972 | 0.003 | -0.623 | -0.128 |
| tags_Ringing | -0.9581 | 0.070 | -13.702 | 0.000 | -1.095 | -0.821 |
| tags_Will revert after reading the email | 1.9035 | 0.069 | 27.620 | 0.000 | 1.768 | 2.039 |
| tags_switched off | -0.4885 | 0.083 | -5.851 | 0.000 | -0.652 | -0.325 |
| tags_wrong number given | -1.6314 | 835.166 | -0.002 | 0.998 | -1638.527 | 1635.264 |
VIF Score:
| Features | VIF | |
|---|---|---|
| 3 | lead_origin_Lead Add Form | 1.670 |
| 4 | lead_source_Olark Chat | 1.450 |
| 12 | tags_Will revert after reading the email | 1.450 |
| 0 | ttime_on_site | 1.420 |
| 5 | lead_source_Welingak Website | 1.310 |
| 1 | last_activity_Olark Chat Conversation | 1.280 |
| 7 | tags_Closed by Horizzon | 1.220 |
| 11 | tags_Ringing | 1.180 |
| 2 | last_activity_SMS Sent | 1.170 |
| 8 | tags_Interested in other courses | 1.090 |
| 6 | tags_Already a student | 1.070 |
| 13 | tags_switched off | 1.050 |
| 9 | tags_Lost to EINS | 1.040 |
| 10 | tags_Not doing further education | 1.030 |
| 14 | tags_wrong number given | 1.010 |
Model 2¶
# col = col.drop('curr_occupation_Housewife', 1)
# col
col = col.drop('tags_wrong number given', 1)
col
cutoff = 0.5
res, y_train_pred,y_train_pred_final = logreg_train_pred_fn(X_train, y_train, col, cutoff)
confusion, accuracy = logreg_metrics_fn(y_train_pred_final)
vif = logreg_VIF_score_fn(X_train, col)
print('Model Summary:') # Model Summary:
res.summary()
print('\nVIF Score:') # VIF Score:
vif
# print('\nY_Predicted Values:') # Y_Predicted Values:
# y_train_pred
# print('\nY_Predicted Cutoff:') # Y_Predicted Cutoff:
# y_train_pred_final
print('\nConfusion Matrix:') # Confusion Matrix:
confusion
print(f'\nAccuracy Score: {accuracy}\n') # Accuracy Score:
Index(['ttime_on_site', 'last_activity_Olark Chat Conversation',
'last_activity_SMS Sent', 'lead_origin_Lead Add Form',
'lead_source_Olark Chat', 'lead_source_Welingak Website',
'tags_Already a student', 'tags_Closed by Horizzon',
'tags_Interested in other courses', 'tags_Lost to EINS',
'tags_Not doing further education', 'tags_Ringing',
'tags_Will revert after reading the email', 'tags_switched off'],
dtype='object')
Model Summary:
| Dep. Variable: | converted | No. Observations: | 7282 |
|---|---|---|---|
| Model: | GLM | Df Residuals: | 7267 |
| Model Family: | Binomial | Df Model: | 14 |
| Link Function: | Logit | Scale: | 1.0000 |
| Method: | IRLS | Log-Likelihood: | -1613.0 |
| Date: | Sun, 20 Oct 2024 | Deviance: | 3225.9 |
| Time: | 23:04:22 | Pearson chi2: | 8.54e+03 |
| No. Iterations: | 8 | Pseudo R-squ. (CS): | 0.5878 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -0.7748 | 0.074 | -10.459 | 0.000 | -0.920 | -0.630 |
| ttime_on_site | 1.0570 | 0.052 | 20.244 | 0.000 | 0.955 | 1.159 |
| last_activity_Olark Chat Conversation | -0.4698 | 0.064 | -7.339 | 0.000 | -0.595 | -0.344 |
| last_activity_SMS Sent | 0.8479 | 0.046 | 18.582 | 0.000 | 0.758 | 0.937 |
| lead_origin_Lead Add Form | 0.4040 | 0.091 | 4.445 | 0.000 | 0.226 | 0.582 |
| lead_source_Olark Chat | 0.5964 | 0.051 | 11.614 | 0.000 | 0.496 | 0.697 |
| lead_source_Welingak Website | 0.5749 | 0.126 | 4.567 | 0.000 | 0.328 | 0.822 |
| tags_Already a student | -0.7561 | 0.156 | -4.832 | 0.000 | -1.063 | -0.449 |
| tags_Closed by Horizzon | 1.2040 | 0.136 | 8.823 | 0.000 | 0.937 | 1.472 |
| tags_Interested in other courses | -0.4463 | 0.074 | -6.029 | 0.000 | -0.591 | -0.301 |
| tags_Lost to EINS | 0.6987 | 0.071 | 9.785 | 0.000 | 0.559 | 0.839 |
| tags_Not doing further education | -0.3729 | 0.126 | -2.951 | 0.003 | -0.621 | -0.125 |
| tags_Ringing | -0.9469 | 0.070 | -13.565 | 0.000 | -1.084 | -0.810 |
| tags_Will revert after reading the email | 1.9121 | 0.069 | 27.758 | 0.000 | 1.777 | 2.047 |
| tags_switched off | -0.4829 | 0.083 | -5.786 | 0.000 | -0.647 | -0.319 |
VIF Score:
| Features | VIF | |
|---|---|---|
| 3 | lead_origin_Lead Add Form | 1.670 |
| 4 | lead_source_Olark Chat | 1.450 |
| 12 | tags_Will revert after reading the email | 1.450 |
| 0 | ttime_on_site | 1.420 |
| 5 | lead_source_Welingak Website | 1.310 |
| 1 | last_activity_Olark Chat Conversation | 1.280 |
| 7 | tags_Closed by Horizzon | 1.220 |
| 2 | last_activity_SMS Sent | 1.170 |
| 11 | tags_Ringing | 1.170 |
| 8 | tags_Interested in other courses | 1.090 |
| 6 | tags_Already a student | 1.070 |
| 13 | tags_switched off | 1.050 |
| 9 | tags_Lost to EINS | 1.040 |
| 10 | tags_Not doing further education | 1.030 |
Confusion Matrix:
array([[4302, 204],
[ 389, 2387]])
Accuracy Score: 0.9185663279318869
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
# Calculate the sensitivity
TP/(TP+FN)
# Calculate the specificity
TN/(TN+FP)
0.8598703170028819
0.9547270306258322
Finding Optimal Cutoff Point¶
# Let's create columns with different probability cutoffs
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
y_train_pred_final[i]= y_train_pred_final.Conv_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()
| Converted | Conv_Prob | ID | predicted | 0.000 | 0.100 | 0.200 | 0.300 | 0.400 | 0.500 | 0.600 | 0.700 | 0.800 | 0.900 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.002 | 7740 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0.966 | 7014 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 0 | 0.004 | 2431 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1 | 0.983 | 3012 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 4 | 0 | 0.165 | 3140 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
cm1 = confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
total1=sum(sum(cm1))
accuracy = (cm1[0,0]+cm1[1,1])/total1
speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)
prob accuracy sensi speci 0.000 0.000 0.381 1.000 0.000 0.100 0.100 0.832 0.970 0.747 0.200 0.200 0.884 0.930 0.856 0.300 0.300 0.904 0.900 0.906 0.400 0.400 0.916 0.875 0.942 0.500 0.500 0.919 0.860 0.955 0.600 0.600 0.907 0.807 0.968 0.700 0.700 0.902 0.785 0.975 0.800 0.800 0.896 0.755 0.982 0.900 0.900 0.874 0.686 0.990
# Let's plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()
<Axes: xlabel='prob'>
y_train_pred_final['final_predicted'] = y_train_pred_final.Conv_Prob.map( lambda x: 1 if x > 0.29 else 0)
y_train_pred_final.head()
| Converted | Conv_Prob | ID | predicted | 0.000 | 0.100 | 0.200 | 0.300 | 0.400 | 0.500 | 0.600 | 0.700 | 0.800 | 0.900 | final_predicted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.002 | 7740 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0.966 | 7014 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 0 | 0.004 | 2431 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1 | 0.983 | 3012 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 4 | 0 | 0.165 | 3140 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Let's check the overall accuracy.
accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)
0.9026366382861851
confusion = confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
confusion
array([[4069, 437],
[ 272, 2504]])
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
# Calculate the sensitivity
TP/(TP+FN)
# Calculate the specificity
TN/(TN+FP)
precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
0.9020172910662824
0.9030181979582779
0.9212659204940178
0.8598703170028819
----------------------------------------------------------------------¶
ROC Curve and Precision - Recall Curve¶
def draw_roc( actual, probs ):
fpr, tpr, thresholds = roc_curve( actual, probs, drop_intermediate = False )
auc_score = roc_auc_score( actual, probs )
plt.figure(figsize=(5, 5))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
return None
fpr, tpr, thresholds = roc_curve( y_train_pred_final.Converted, y_train_pred_final.Conv_Prob, drop_intermediate = False )
draw_roc(y_train_pred_final.Converted, y_train_pred_final.Conv_Prob)
precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
0.9212659204940178
0.8598703170028819
p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Conv_Prob)
plt.plot(thresholds, p[:-1], "b")
plt.plot(thresholds, r[:-1], "r")
plt.title('Precision Recall Curve')
plt.show();
y_train_pred_final['final_predicted'] = y_train_pred_final.Conv_Prob.map( lambda x: 1 if x > 0.39 else 0)
y_train_pred_final.head()
| Converted | Conv_Prob | ID | predicted | 0.000 | 0.100 | 0.200 | 0.300 | 0.400 | 0.500 | 0.600 | 0.700 | 0.800 | 0.900 | final_predicted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.002 | 7740 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0.966 | 7014 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 0 | 0.004 | 2431 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1 | 0.983 | 3012 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 4 | 0 | 0.165 | 3140 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Let's check the overall accuracy.
accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)
0.9147212304312002
confusion = confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
confusion
array([[4229, 277],
[ 344, 2432]])
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
# Calculate the sensitivity
TP/(TP+FN)
# Calculate the specificity
TN/(TN+FP)
precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
0.8760806916426513
0.9385264092321349
0.9212659204940178
0.8598703170028819
---------------------------------------------------------------¶
Predictions on the test set¶
Custom Functions for Test¶
def logreg_test_pred_fn(fX_test, fy_test, fcol, fcutoff, fres):
fX_test_sm = sm.add_constant(fX_test[fcol])
fy_test_pred = fres.predict(fX_test_sm)
fy_test_pred = fy_test_pred.values.reshape(-1)
fy_test_pred_final = pd.DataFrame({'Converted':fy_test.values, 'Conv_Prob':fy_test_pred})
fy_test_pred_final['ID'] = fy_test.index
fy_test_pred_final['predicted'] = fy_test_pred_final.Conv_Prob.map(lambda x: 1 if x > fcutoff else 0)
return fres, fy_test_pred,fy_test_pred_final
def logreg_test_metrics_fn(fy_test_pred_final):
fconfusion = confusion_matrix(fy_test_pred_final.Converted, fy_test_pred_final.predicted )
faccuracy = accuracy_score(fy_test_pred_final.Converted, fy_test_pred_final.predicted)
return fconfusion, faccuracy
def logreg_test_VIF_score_fn(fX_test, fcol):
fvif = pd.DataFrame()
fvif['Features'] = fX_test[fcol].columns
fvif['VIF'] = [variance_inflation_factor(fX_test[fcol].values, i) for i in range(fX_test[fcol].shape[1])]
fvif['VIF'] = round(fvif['VIF'], 2)
fvif = fvif.sort_values(by = "VIF", ascending = False)
return fvif
Model Validation on Test Data¶
X_test[to_scale] = scaler.transform(X_test[to_scale])
X_test[col].head(2)
X_test.shape
| ttime_on_site | last_activity_Olark Chat Conversation | last_activity_SMS Sent | lead_origin_Lead Add Form | lead_source_Olark Chat | lead_source_Welingak Website | tags_Already a student | tags_Closed by Horizzon | tags_Interested in other courses | tags_Lost to EINS | tags_Not doing further education | tags_Ringing | tags_Will revert after reading the email | tags_switched off | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7421 | -0.502 | -0.345 | -0.647 | -0.265 | -0.485 | -0.119 | -0.229 | 5.116 | -0.245 | -0.138 | -0.127 | -0.389 | -0.532 | -0.164 |
| 4570 | -0.409 | -0.345 | 1.545 | -0.265 | -0.485 | -0.119 | -0.229 | -0.195 | -0.245 | -0.138 | -0.127 | -0.389 | -0.532 | -0.164 |
(1821, 99)
cutoff = 0.39
res, y_test_pred, y_test_pred_final = logreg_test_pred_fn(X_test, y_test, col, cutoff, res)
confusion, accuracy = logreg_test_metrics_fn(y_test_pred_final)
vif = logreg_test_VIF_score_fn(X_test, col)
# print('Model Summary:') # Model Summary:
# res.summary()
# print('\nVIF Score:') # VIF Score:
# vif
# print('\nY_Predicted Values:') # Y_Predicted Values:
# y_test_pred
# print('\nY_Predicted Cutoff:') # Y_Predicted Cutoff:
# y_test_pred_final
print('\nConfusion Matrix:') # Confusion Matrix:
confusion
print(f'\nAccuracy Score: {accuracy}\n') # Accuracy Score:
Confusion Matrix:
array([[1080, 65],
[ 61, 615]])
Accuracy Score: 0.9308072487644151
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
# Calculate the sensitivity
TP/(TP+FN)
# Calculate the specificity
TN/(TN+FP)
precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
0.9097633136094675
0.9432314410480349
0.9212659204940178
0.8598703170028819
Approach - 01 - Accuracy Score:¶
print(f'\nAccuracy Score: {accuracy}\n') # Accuracy Score:
Accuracy Score: 0.9308072487644151
-----------------------------------------------------¶
# errorline # do no remove this line
-----------------------------------------------------¶
Approach - 02¶
Data Encoding¶
Label Encoding¶
# for label_encoding:
# Since there are many categorical features with nominal values 5 we decide to use label encoding for those features
cols_to_le = lead_score_df.select_dtypes('category').columns
cols_to_le
Index(['lead_origin', 'lead_source', 'do_not_email', 'last_activity',
'specialization', 'curr_occupation', 'tags', 'city', 'avail_free_copy'],
dtype='object')
df_le = lead_score_df.copy()
le = LabelEncoder()
df_le[cols_to_le] = df_le[cols_to_le].apply(le.fit_transform)
plt.figure(figsize=(10, 6))
sns.heatmap(df_le.corr(), annot=True)
<Figure size 1000x600 with 0 Axes>
<Axes: >
# We then merged the encoder features with the non encoded features and name it as new lsdf
# new_ls_df = lead_score_df[lead_score_df.columns.difference(cols_to_le)]
# new_ls_df = new_ls_df.merge(right=df_le, right_index=True, left_index=True)
df_le.head(3)
| lead_origin | lead_source | do_not_email | converted | totalvisits | ttime_on_site | pg_view_pv | last_activity | specialization | curr_occupation | tags | city | avail_free_copy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 6 | 0 | 0 | 0.000 | 0.000 | 0.000 | 10 | 18 | 4 | 8 | 0 | 0 |
| 1 | 0 | 7 | 0 | 0 | 5.000 | 674.000 | 2.500 | 5 | 18 | 4 | 14 | 0 | 0 |
| 2 | 1 | 1 | 0 | 1 | 2.000 | 1532.000 | 2.000 | 5 | 1 | 3 | 19 | 0 | 1 |
# our null values have significantly reduced
null_pct = check_cols_null_pct(df_le)
null_pct[null_pct>0]
df_le.shape
Series([], dtype: float64)
(9103, 13)
# new_ls_df = new_ls_df[(new_ls_df.notna()).all(axis=1)]
# check_cols_null_pct(new_ls_df)
# sorted([f'{i} - {new_ls_df[i].unique()}' for i in new_ls_df.columns])
----------------------------------------------------------------------¶
Train and Test Split¶
X = df_le.drop(['converted'], axis=1)
y = df_le['converted']
# Now we split the dataset into train and test set
np.random.seed(0)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
----------------------------------------------------------------------¶
Feature Scaling¶
# Post split we perform standard scaling they fit and transform the train data set
# to_scale = ['lead_number', 'totalvisits', 'ttime_on_site', 'pg_view_pv']
to_scale = ['totalvisits', 'ttime_on_site', 'pg_view_pv', 'lead_source', 'last_activity', 'curr_occupation']
scaler = StandardScaler()
X_train[to_scale] = scaler.fit_transform(X_train[to_scale],y_train)
X_train.head()
| lead_origin | lead_source | do_not_email | totalvisits | ttime_on_site | pg_view_pv | last_activity | specialization | curr_occupation | tags | city | avail_free_copy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7740 | 1 | -1.063 | 0 | -0.779 | -0.376 | -0.712 | -1.823 | 7 | -0.122 | 0 | 2 | 1 |
| 7014 | 1 | -0.392 | 0 | 0.971 | 0.881 | -0.150 | -0.729 | 4 | -0.122 | 19 | 0 | 0 |
| 2431 | 1 | -0.392 | 0 | -0.429 | -0.337 | -0.150 | -0.182 | 8 | -0.122 | 12 | 0 | 0 |
| 3012 | 0 | 0.949 | 0 | 1.672 | 1.539 | 0.973 | -0.729 | 5 | -0.122 | 19 | 0 | 0 |
| 3140 | 0 | 0.614 | 0 | -1.130 | -0.889 | -1.274 | -0.729 | 18 | -0.122 | 25 | 0 | 0 |
----------------------------------------------------------------------¶
Model Building¶
Custom Functions for Model Training¶
# We create custom functions for model veiling since iteration we reuse certain functions again and again
# Train and predict function trains the model and predicts on the same data and returns the model its probability and predicted values based on cutoff
# The matrix function returns confusion matrix and accuracy score
# The vif function returns the vif score for the features
def logreg_train_pred_fn(fX_train, fy_train, fcol, fcutoff):
fX_train_sm = sm.add_constant(fX_train[fcol])
flogm = sm.GLM(fy_train, fX_train_sm, family = sm.families.Binomial())
fres = flogm.fit()
fy_train_pred = fres.predict(fX_train_sm)
fy_train_pred = fy_train_pred.values.reshape(-1)
fy_train_pred_final = pd.DataFrame({'Converted':fy_train.values, 'Conv_Prob':fy_train_pred})
fy_train_pred_final['ID'] = fy_train.index
fy_train_pred_final['predicted'] = fy_train_pred_final.Conv_Prob.map(lambda x: 1 if x > fcutoff else 0)
return fres, fy_train_pred,fy_train_pred_final
def logreg_metrics_fn(fy_train_pred_final):
fconfusion = confusion_matrix(fy_train_pred_final.Converted, fy_train_pred_final.predicted )
faccuracy = accuracy_score(fy_train_pred_final.Converted, fy_train_pred_final.predicted)
return fconfusion, faccuracy
def logreg_VIF_score_fn(fX_train, fcol):
fvif = pd.DataFrame()
fvif['Features'] = fX_train[fcol].columns
fvif['VIF'] = [variance_inflation_factor(fX_train[fcol].values, i) for i in range(fX_train[fcol].shape[1])]
fvif['VIF'] = round(fvif['VIF'], 2)
fvif = fvif.sort_values(by = "VIF", ascending = False)
return fvif
Base Model 1¶
# Logistic regression model
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
res = logm1.fit()
# res.summary()
RFE - Recursive Feature Elimination¶
# Since the data set has a lot of features we perform rfe to eliminate insignificant features
logreg = LogisticRegression()
rfe = RFE(estimator=logreg, n_features_to_select=15) # running RFE with 15 variables as output
rfe = rfe.fit(X_train, y_train)
rfe_feature_Ranking = list(zip(X_train.columns, rfe.support_, rfe.ranking_))
rfe_sorted = sorted(rfe_feature_Ranking, key=lambda x : x[2])
rfe_sorted[:15]
[('lead_origin', True, 1),
('lead_source', True, 1),
('do_not_email', True, 1),
('totalvisits', True, 1),
('ttime_on_site', True, 1),
('pg_view_pv', True, 1),
('last_activity', True, 1),
('specialization', True, 1),
('curr_occupation', True, 1),
('tags', True, 1),
('city', True, 1),
('avail_free_copy', True, 1)]
col = X_train.columns[rfe.support_]
# X_train.columns[~rfe.support_]
Model 1¶
# Now we perform model iteration as many times as possible till we get an optimum result
cutoff = 0.5
res, y_train_pred,y_train_pred_final = logreg_train_pred_fn(X_train, y_train, col, cutoff)
confusion, accuracy = logreg_metrics_fn(y_train_pred_final)
vif = logreg_VIF_score_fn(X_train, col)
print('Model Summary:') # Model Summary:
res.summary()
print('\nVIF Score:') # VIF Score:
vif
# print('\nY_Predicted Values:') # Y_Predicted Values:
# y_train_pred
# print('\nY_Predicted Cutoff:') # Y_Predicted Cutoff:
# y_train_pred_final
# print('\nConfusion Matrix:') # Confusion Matrix:
# confusion
# print(f'\nAccuracy Score: {accuracy}\n') # Accuracy Score:
Model Summary:
| Dep. Variable: | converted | No. Observations: | 7282 |
|---|---|---|---|
| Model: | GLM | Df Residuals: | 7269 |
| Model Family: | Binomial | Df Model: | 12 |
| Link Function: | Logit | Scale: | 1.0000 |
| Method: | IRLS | Log-Likelihood: | -3558.5 |
| Date: | Sun, 20 Oct 2024 | Deviance: | 7117.0 |
| Time: | 23:04:24 | Pearson chi2: | 9.24e+03 |
| No. Iterations: | 5 | Pseudo R-squ. (CS): | 0.2967 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -0.5614 | 0.121 | -4.656 | 0.000 | -0.798 | -0.325 |
| lead_origin | 0.6757 | 0.058 | 11.667 | 0.000 | 0.562 | 0.789 |
| lead_source | 0.5132 | 0.035 | 14.595 | 0.000 | 0.444 | 0.582 |
| do_not_email | -1.4810 | 0.145 | -10.216 | 0.000 | -1.765 | -1.197 |
| totalvisits | 0.1084 | 0.044 | 2.443 | 0.015 | 0.021 | 0.195 |
| ttime_on_site | 1.0413 | 0.034 | 30.765 | 0.000 | 0.975 | 1.108 |
| pg_view_pv | -0.4758 | 0.047 | -10.069 | 0.000 | -0.568 | -0.383 |
| last_activity | 0.5183 | 0.031 | 16.677 | 0.000 | 0.457 | 0.579 |
| specialization | -0.0415 | 0.006 | -6.841 | 0.000 | -0.053 | -0.030 |
| curr_occupation | 0.4308 | 0.036 | 11.966 | 0.000 | 0.360 | 0.501 |
| tags | 0.0096 | 0.004 | 2.512 | 0.012 | 0.002 | 0.017 |
| city | 0.0398 | 0.024 | 1.687 | 0.092 | -0.006 | 0.086 |
| avail_free_copy | -0.4397 | 0.073 | -5.994 | 0.000 | -0.583 | -0.296 |
VIF Score:
| Features | VIF | |
|---|---|---|
| 9 | tags | 4.170 |
| 7 | specialization | 3.500 |
| 5 | pg_view_pv | 2.540 |
| 3 | totalvisits | 2.410 |
| 0 | lead_origin | 2.320 |
| 11 | avail_free_copy | 1.930 |
| 10 | city | 1.380 |
| 1 | lead_source | 1.330 |
| 4 | ttime_on_site | 1.220 |
| 2 | do_not_email | 1.100 |
| 6 | last_activity | 1.040 |
| 8 | curr_occupation | 1.040 |
Model 2¶
# col = col.drop('avail_free_copy_1', 1)
# col
col = col.drop('city', 1)
col
cutoff = 0.5
res, y_train_pred,y_train_pred_final = logreg_train_pred_fn(X_train, y_train, col, cutoff)
confusion, accuracy = logreg_metrics_fn(y_train_pred_final)
vif = logreg_VIF_score_fn(X_train, col)
print('Model Summary:') # Model Summary:
res.summary()
print('\nVIF Score:') # VIF Score:
vif
# print('\nY_Predicted Values:') # Y_Predicted Values:
# y_train_pred
# print('\nY_Predicted Cutoff:') # Y_Predicted Cutoff:
# y_train_pred_final
print('\nConfusion Matrix:') # Confusion Matrix:
confusion
print(f'\nAccuracy Score: {accuracy}\n') # Accuracy Score:
Index(['lead_origin', 'lead_source', 'do_not_email', 'totalvisits',
'ttime_on_site', 'pg_view_pv', 'last_activity', 'specialization',
'curr_occupation', 'tags', 'avail_free_copy'],
dtype='object')
Model Summary:
| Dep. Variable: | converted | No. Observations: | 7282 |
|---|---|---|---|
| Model: | GLM | Df Residuals: | 7270 |
| Model Family: | Binomial | Df Model: | 11 |
| Link Function: | Logit | Scale: | 1.0000 |
| Method: | IRLS | Log-Likelihood: | -3559.9 |
| Date: | Sun, 20 Oct 2024 | Deviance: | 7119.9 |
| Time: | 23:04:24 | Pearson chi2: | 9.27e+03 |
| No. Iterations: | 5 | Pseudo R-squ. (CS): | 0.2964 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -0.5380 | 0.120 | -4.492 | 0.000 | -0.773 | -0.303 |
| lead_origin | 0.6887 | 0.057 | 11.986 | 0.000 | 0.576 | 0.801 |
| lead_source | 0.5109 | 0.035 | 14.541 | 0.000 | 0.442 | 0.580 |
| do_not_email | -1.4786 | 0.145 | -10.196 | 0.000 | -1.763 | -1.194 |
| totalvisits | 0.1102 | 0.044 | 2.484 | 0.013 | 0.023 | 0.197 |
| ttime_on_site | 1.0427 | 0.034 | 30.828 | 0.000 | 0.976 | 1.109 |
| pg_view_pv | -0.4724 | 0.047 | -10.014 | 0.000 | -0.565 | -0.380 |
| last_activity | 0.5156 | 0.031 | 16.622 | 0.000 | 0.455 | 0.576 |
| specialization | -0.0423 | 0.006 | -6.991 | 0.000 | -0.054 | -0.030 |
| curr_occupation | 0.4307 | 0.036 | 11.954 | 0.000 | 0.360 | 0.501 |
| tags | 0.0096 | 0.004 | 2.530 | 0.011 | 0.002 | 0.017 |
| avail_free_copy | -0.4317 | 0.073 | -5.899 | 0.000 | -0.575 | -0.288 |
VIF Score:
| Features | VIF | |
|---|---|---|
| 9 | tags | 4.140 |
| 7 | specialization | 3.500 |
| 5 | pg_view_pv | 2.540 |
| 3 | totalvisits | 2.400 |
| 0 | lead_origin | 2.200 |
| 10 | avail_free_copy | 1.910 |
| 1 | lead_source | 1.320 |
| 4 | ttime_on_site | 1.220 |
| 2 | do_not_email | 1.100 |
| 6 | last_activity | 1.040 |
| 8 | curr_occupation | 1.040 |
Confusion Matrix:
array([[3941, 565],
[1047, 1729]])
Accuracy Score: 0.7786322438890415
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
# Calculate the sensitivity
TP/(TP+FN)
# Calculate the specificity
TN/(TN+FP)
0.6228386167146974
0.8746116289391922
----------------------------------------------------------------------¶
Finding Optimal Cutoff Point¶
# Let's create columns with different probability cutoffs
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
y_train_pred_final[i]= y_train_pred_final.Conv_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()
| Converted | Conv_Prob | ID | predicted | 0.000 | 0.100 | 0.200 | 0.300 | 0.400 | 0.500 | 0.600 | 0.700 | 0.800 | 0.900 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.095 | 7740 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0.653 | 7014 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| 2 | 0 | 0.322 | 2431 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1 | 0.694 | 3012 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| 4 | 0 | 0.165 | 3140 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
cm1 = confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
total1=sum(sum(cm1))
accuracy = (cm1[0,0]+cm1[1,1])/total1
speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)
prob accuracy sensi speci 0.000 0.000 0.381 1.000 0.000 0.100 0.100 0.492 0.974 0.194 0.200 0.200 0.642 0.885 0.492 0.300 0.300 0.762 0.818 0.727 0.400 0.400 0.779 0.714 0.820 0.500 0.500 0.779 0.623 0.875 0.600 0.600 0.766 0.525 0.914 0.700 0.700 0.740 0.408 0.945 0.800 0.800 0.700 0.261 0.971 0.900 0.900 0.652 0.100 0.993
# Let's plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()
<Axes: xlabel='prob'>
y_train_pred_final['final_predicted'] = y_train_pred_final.Conv_Prob.map( lambda x: 1 if x > 0.36 else 0)
y_train_pred_final.head()
| Converted | Conv_Prob | ID | predicted | 0.000 | 0.100 | 0.200 | 0.300 | 0.400 | 0.500 | 0.600 | 0.700 | 0.800 | 0.900 | final_predicted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.095 | 7740 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0.653 | 7014 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 |
| 2 | 0 | 0.322 | 2431 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1 | 0.694 | 3012 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 |
| 4 | 0 | 0.165 | 3140 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Let's check the overall accuracy.
accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)
0.7728645976380115
confusion = confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
confusion
array([[3564, 942],
[ 712, 2064]])
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
# Calculate the sensitivity
TP/(TP+FN)
# Calculate the specificity
TN/(TN+FP)
0.7435158501440923
0.7909454061251664
----------------------------------------------------------------------¶
ROC Curve and Precision - Recall Curve¶
def draw_roc( actual, probs ):
fpr, tpr, thresholds = roc_curve( actual, probs, drop_intermediate = False )
auc_score = roc_auc_score( actual, probs )
plt.figure(figsize=(5, 5))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
return None
fpr, tpr, thresholds = roc_curve( y_train_pred_final.Converted, y_train_pred_final.Conv_Prob, drop_intermediate = False )
draw_roc(y_train_pred_final.Converted, y_train_pred_final.Conv_Prob)
precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
0.7537053182214473
0.6228386167146974
p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Conv_Prob)
plt.plot(thresholds, p[:-1], "b")
plt.plot(thresholds, r[:-1], "r")
plt.title('Precision Recall Curve')
plt.show();
y_train_pred_final['final_predicted'] = y_train_pred_final.Conv_Prob.map( lambda x: 1 if x > 0.40 else 0)
y_train_pred_final.head()
| Converted | Conv_Prob | ID | predicted | 0.000 | 0.100 | 0.200 | 0.300 | 0.400 | 0.500 | 0.600 | 0.700 | 0.800 | 0.900 | final_predicted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.095 | 7740 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0.653 | 7014 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 |
| 2 | 0 | 0.322 | 2431 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1 | 0.694 | 3012 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 |
| 4 | 0 | 0.165 | 3140 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Let's check the overall accuracy.
accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)
0.7794561933534743
confusion = confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
confusion
array([[3693, 813],
[ 793, 1983]])
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
# Calculate the sensitivity
TP/(TP+FN)
# Calculate the specificity
TN/(TN+FP)
precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
0.7143371757925072
0.8195739014647138
0.7537053182214473
0.6228386167146974
----------------------------------------------------------------------¶
Predictions on the test set¶
Custom Functions for Test¶
def logreg_test_pred_fn(fX_test, fy_test, fcol, fcutoff, fres):
fX_test_sm = sm.add_constant(fX_test[fcol])
fy_test_pred = fres.predict(fX_test_sm)
fy_test_pred = fy_test_pred.values.reshape(-1)
fy_test_pred_final = pd.DataFrame({'Converted':fy_test.values, 'Conv_Prob':fy_test_pred})
fy_test_pred_final['ID'] = fy_test.index
fy_test_pred_final['predicted'] = fy_test_pred_final.Conv_Prob.map(lambda x: 1 if x > fcutoff else 0)
return fres, fy_test_pred,fy_test_pred_final
def logreg_test_metrics_fn(fy_test_pred_final):
fconfusion = confusion_matrix(fy_test_pred_final.Converted, fy_test_pred_final.predicted )
faccuracy = accuracy_score(fy_test_pred_final.Converted, fy_test_pred_final.predicted)
return fconfusion, faccuracy
def logreg_test_VIF_score_fn(fX_test, fcol):
fvif = pd.DataFrame()
fvif['Features'] = fX_test[fcol].columns
fvif['VIF'] = [variance_inflation_factor(fX_test[fcol].values, i) for i in range(fX_test[fcol].shape[1])]
fvif['VIF'] = round(fvif['VIF'], 2)
fvif = fvif.sort_values(by = "VIF", ascending = False)
return fvif
Model Validation on Test Data¶
X_test[to_scale] = scaler.transform(X_test[to_scale])
X_test[col].head()
| lead_origin | lead_source | do_not_email | totalvisits | ttime_on_site | pg_view_pv | last_activity | specialization | curr_occupation | tags | avail_free_copy | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 7421 | 1 | -1.063 | 1 | 2.722 | -0.502 | 0.102 | 0.639 | 4 | -0.122 | 2 | 1 |
| 4570 | 1 | -0.392 | 1 | -0.079 | -0.409 | 0.411 | 1.186 | 12 | -0.122 | 25 | 0 |
| 5302 | 0 | 0.614 | 0 | -1.130 | -0.889 | -1.274 | 0.365 | 18 | -0.122 | 25 | 0 |
| 6471 | 1 | 0.949 | 1 | 0.971 | -0.293 | 2.096 | 1.186 | 13 | -0.122 | 24 | 1 |
| 8958 | 1 | -1.063 | 0 | 0.271 | 0.708 | -0.150 | -0.729 | 7 | -0.122 | 0 | 1 |
cutoff = 0.39
res, y_test_pred, y_test_pred_final = logreg_test_pred_fn(X_test, y_test, col, cutoff, res)
confusion, accuracy = logreg_test_metrics_fn(y_test_pred_final)
vif = logreg_test_VIF_score_fn(X_test, col)
print('Model Summary:') # Model Summary:
res.summary()
print('\nVIF Score:') # VIF Score:
vif
# print('\nY_Predicted Values:') # Y_Predicted Values:
# y_test_pred
# print('\nY_Predicted Cutoff:') # Y_Predicted Cutoff:
# y_test_pred_final
Model Summary:
| Dep. Variable: | converted | No. Observations: | 7282 |
|---|---|---|---|
| Model: | GLM | Df Residuals: | 7270 |
| Model Family: | Binomial | Df Model: | 11 |
| Link Function: | Logit | Scale: | 1.0000 |
| Method: | IRLS | Log-Likelihood: | -3559.9 |
| Date: | Sun, 20 Oct 2024 | Deviance: | 7119.9 |
| Time: | 23:04:25 | Pearson chi2: | 9.27e+03 |
| No. Iterations: | 5 | Pseudo R-squ. (CS): | 0.2964 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -0.5380 | 0.120 | -4.492 | 0.000 | -0.773 | -0.303 |
| lead_origin | 0.6887 | 0.057 | 11.986 | 0.000 | 0.576 | 0.801 |
| lead_source | 0.5109 | 0.035 | 14.541 | 0.000 | 0.442 | 0.580 |
| do_not_email | -1.4786 | 0.145 | -10.196 | 0.000 | -1.763 | -1.194 |
| totalvisits | 0.1102 | 0.044 | 2.484 | 0.013 | 0.023 | 0.197 |
| ttime_on_site | 1.0427 | 0.034 | 30.828 | 0.000 | 0.976 | 1.109 |
| pg_view_pv | -0.4724 | 0.047 | -10.014 | 0.000 | -0.565 | -0.380 |
| last_activity | 0.5156 | 0.031 | 16.622 | 0.000 | 0.455 | 0.576 |
| specialization | -0.0423 | 0.006 | -6.991 | 0.000 | -0.054 | -0.030 |
| curr_occupation | 0.4307 | 0.036 | 11.954 | 0.000 | 0.360 | 0.501 |
| tags | 0.0096 | 0.004 | 2.530 | 0.011 | 0.002 | 0.017 |
| avail_free_copy | -0.4317 | 0.073 | -5.899 | 0.000 | -0.575 | -0.288 |
VIF Score:
| Features | VIF | |
|---|---|---|
| 9 | tags | 4.370 |
| 7 | specialization | 3.680 |
| 5 | pg_view_pv | 2.470 |
| 3 | totalvisits | 2.360 |
| 0 | lead_origin | 2.140 |
| 10 | avail_free_copy | 1.940 |
| 1 | lead_source | 1.320 |
| 4 | ttime_on_site | 1.260 |
| 2 | do_not_email | 1.110 |
| 6 | last_activity | 1.040 |
| 8 | curr_occupation | 1.040 |
print('\nConfusion Matrix:') # Confusion Matrix:
confusion
print(f'\nAccuracy Score: {accuracy}\n') # Accuracy Score:
Confusion Matrix:
array([[948, 197],
[195, 481]])
Accuracy Score: 0.7847336628226249
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
# Calculate the sensitivity
TP/(TP+FN)
# Calculate the specificity
TN/(TN+FP)
precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
0.7115384615384616
0.8279475982532751
0.7537053182214473
0.6228386167146974
Approach - 02 - Accuracy Score:¶
print(f'\nAccuracy Score: {accuracy}\n') # Accuracy Score:
Accuracy Score: 0.7847336628226249
-----------------------------------------------------¶
# errorline #do not remove
-----------------------------------------------------¶
Approach - 03¶
Data Encoding¶
Dummy Encoding¶
# we perform dummy encoding
new_ls_df = pd.get_dummies(lead_score_df, columns=lead_score_df.select_dtypes('category').columns.difference(['tags','specialization']), drop_first=True, dtype=float)
new_ls_df.head(1)
new_ls_df = pd.get_dummies(new_ls_df, columns=['tags','specialization'], dtype=float)
new_ls_df = new_ls_df.drop(['tags_unknown','specialization_unknown'], axis=1)
new_ls_df.head(1)
| converted | totalvisits | ttime_on_site | pg_view_pv | specialization | tags | avail_free_copy_1 | city_Other Cities | city_Other Cities of Maharashtra | city_Other Metro Cities | city_Thane & Outskirts | city_Tier II Cities | curr_occupation_Housewife | curr_occupation_Other | curr_occupation_Student | curr_occupation_Unemployed | curr_occupation_Working Professional | do_not_email_1 | last_activity_Converted to Lead | last_activity_Email Bounced | last_activity_Email Link Clicked | last_activity_Email Marked Spam | last_activity_Email Opened | last_activity_Email Received | last_activity_Form Submitted on Website | last_activity_Had a Phone Conversation | last_activity_Olark Chat Conversation | last_activity_Page Visited on Website | last_activity_Resubscribed to emails | last_activity_SMS Sent | last_activity_Unreachable | last_activity_Unsubscribed | last_activity_View in browser link Clicked | last_activity_Visited Booth in Tradeshow | lead_origin_Landing Page Submission | lead_origin_Lead Add Form | lead_origin_Lead Import | lead_origin_Quick Add Form | lead_source_Direct Traffic | lead_source_Facebook | lead_source_Google | lead_source_Live Chat | lead_source_NC_EDM | lead_source_Olark Chat | lead_source_Organic Search | lead_source_Pay per Click Ads | lead_source_Press_Release | lead_source_Reference | lead_source_Referral Sites | lead_source_Social Media | lead_source_WeLearn | lead_source_Welingak Website | lead_source_bing | lead_source_blog | lead_source_google | lead_source_testone | lead_source_welearnblog_Home | lead_source_youtubechannel | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.000 | 0.000 | 0.000 | unknown | Interested in other courses | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| converted | totalvisits | ttime_on_site | pg_view_pv | avail_free_copy_1 | city_Other Cities | city_Other Cities of Maharashtra | city_Other Metro Cities | city_Thane & Outskirts | city_Tier II Cities | curr_occupation_Housewife | curr_occupation_Other | curr_occupation_Student | curr_occupation_Unemployed | curr_occupation_Working Professional | do_not_email_1 | last_activity_Converted to Lead | last_activity_Email Bounced | last_activity_Email Link Clicked | last_activity_Email Marked Spam | last_activity_Email Opened | last_activity_Email Received | last_activity_Form Submitted on Website | last_activity_Had a Phone Conversation | last_activity_Olark Chat Conversation | last_activity_Page Visited on Website | last_activity_Resubscribed to emails | last_activity_SMS Sent | last_activity_Unreachable | last_activity_Unsubscribed | last_activity_View in browser link Clicked | last_activity_Visited Booth in Tradeshow | lead_origin_Landing Page Submission | lead_origin_Lead Add Form | lead_origin_Lead Import | lead_origin_Quick Add Form | lead_source_Direct Traffic | lead_source_Facebook | lead_source_Google | lead_source_Live Chat | lead_source_NC_EDM | lead_source_Olark Chat | lead_source_Organic Search | lead_source_Pay per Click Ads | lead_source_Press_Release | lead_source_Reference | lead_source_Referral Sites | lead_source_Social Media | lead_source_WeLearn | lead_source_Welingak Website | lead_source_bing | lead_source_blog | lead_source_google | lead_source_testone | lead_source_welearnblog_Home | lead_source_youtubechannel | tags_Already a student | tags_Busy | tags_Closed by Horizzon | tags_Diploma holder (Not Eligible) | tags_Graduation in progress | tags_In confusion whether part time or DLP | tags_Interested in full time MBA | tags_Interested in Next batch | tags_Interested in other courses | tags_Lateral student | tags_Lost to EINS | tags_Lost to Others | tags_Not doing further education | tags_Recognition issue (DEC approval) | tags_Ringing | tags_Shall take in the next coming month | tags_Still Thinking | tags_University not recognized | tags_Want to take admission but has financial problems | tags_Will revert after reading the email | tags_in touch with EINS | tags_invalid number | tags_number not provided | tags_opp hangup | tags_switched off | tags_wrong number given | specialization_Banking, Investment And Insurance | specialization_Business Administration | specialization_E-Business | specialization_E-COMMERCE | specialization_Finance Management | specialization_Healthcare Management | specialization_Hospitality Management | specialization_Human Resource Management | specialization_IT Projects Management | specialization_International Business | specialization_Marketing Management | specialization_Media and Advertising | specialization_Operations Management | specialization_Retail Management | specialization_Rural and Agribusiness | specialization_Services Excellence | specialization_Supply Chain Management | specialization_Travel and Tourism | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
----------------------------------------------------------------------¶
Train and Test Split¶
X = new_ls_df.drop(['converted'], axis=1)
y = new_ls_df['converted']
# Now we split the dataset into train and test set
np.random.seed(0)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
----------------------------------------------------------------------¶
Feature Scaling¶
new_ls_df.dtypes
converted int64 totalvisits float64 ttime_on_site float64 pg_view_pv float64 avail_free_copy_1 float64 city_Other Cities float64 city_Other Cities of Maharashtra float64 city_Other Metro Cities float64 city_Thane & Outskirts float64 city_Tier II Cities float64 curr_occupation_Housewife float64 curr_occupation_Other float64 curr_occupation_Student float64 curr_occupation_Unemployed float64 curr_occupation_Working Professional float64 do_not_email_1 float64 last_activity_Converted to Lead float64 last_activity_Email Bounced float64 last_activity_Email Link Clicked float64 last_activity_Email Marked Spam float64 last_activity_Email Opened float64 last_activity_Email Received float64 last_activity_Form Submitted on Website float64 last_activity_Had a Phone Conversation float64 last_activity_Olark Chat Conversation float64 last_activity_Page Visited on Website float64 last_activity_Resubscribed to emails float64 last_activity_SMS Sent float64 last_activity_Unreachable float64 last_activity_Unsubscribed float64 last_activity_View in browser link Clicked float64 last_activity_Visited Booth in Tradeshow float64 lead_origin_Landing Page Submission float64 lead_origin_Lead Add Form float64 lead_origin_Lead Import float64 lead_origin_Quick Add Form float64 lead_source_Direct Traffic float64 lead_source_Facebook float64 lead_source_Google float64 lead_source_Live Chat float64 lead_source_NC_EDM float64 lead_source_Olark Chat float64 lead_source_Organic Search float64 lead_source_Pay per Click Ads float64 lead_source_Press_Release float64 lead_source_Reference float64 lead_source_Referral Sites float64 lead_source_Social Media float64 lead_source_WeLearn float64 lead_source_Welingak Website float64 lead_source_bing float64 lead_source_blog float64 lead_source_google float64 lead_source_testone float64 lead_source_welearnblog_Home float64 lead_source_youtubechannel float64 tags_Already a student float64 tags_Busy float64 tags_Closed by Horizzon float64 tags_Diploma holder (Not Eligible) float64 tags_Graduation in progress float64 tags_In confusion whether part time or DLP float64 tags_Interested in full time MBA float64 tags_Interested in Next batch float64 tags_Interested in other courses float64 tags_Lateral student float64 tags_Lost to EINS float64 tags_Lost to Others float64 tags_Not doing further education float64 tags_Recognition issue (DEC approval) float64 tags_Ringing float64 tags_Shall take in the next coming month float64 tags_Still Thinking float64 tags_University not recognized float64 tags_Want to take admission but has financial problems float64 tags_Will revert after reading the email float64 tags_in touch with EINS float64 tags_invalid number float64 tags_number not provided float64 tags_opp hangup float64 tags_switched off float64 tags_wrong number given float64 specialization_Banking, Investment And Insurance float64 specialization_Business Administration float64 specialization_E-Business float64 specialization_E-COMMERCE float64 specialization_Finance Management float64 specialization_Healthcare Management float64 specialization_Hospitality Management float64 specialization_Human Resource Management float64 specialization_IT Projects Management float64 specialization_International Business float64 specialization_Marketing Management float64 specialization_Media and Advertising float64 specialization_Operations Management float64 specialization_Retail Management float64 specialization_Rural and Agribusiness float64 specialization_Services Excellence float64 specialization_Supply Chain Management float64 specialization_Travel and Tourism float64 dtype: object
# Post split we perform standard scaling they fit and transform the train data set
# to_scale = ['lead_number', 'totalvisits', 'ttime_on_site', 'pg_view_pv']
to_scale = X.columns
to_scale
scaler = MinMaxScaler()
X_train[to_scale] = scaler.fit_transform(X_train[to_scale], y_train)
X_train.head()
Index(['totalvisits', 'ttime_on_site', 'pg_view_pv', 'avail_free_copy_1',
'city_Other Cities', 'city_Other Cities of Maharashtra',
'city_Other Metro Cities', 'city_Thane & Outskirts',
'city_Tier II Cities', 'curr_occupation_Housewife',
'curr_occupation_Other', 'curr_occupation_Student',
'curr_occupation_Unemployed', 'curr_occupation_Working Professional',
'do_not_email_1', 'last_activity_Converted to Lead',
'last_activity_Email Bounced', 'last_activity_Email Link Clicked',
'last_activity_Email Marked Spam', 'last_activity_Email Opened',
'last_activity_Email Received',
'last_activity_Form Submitted on Website',
'last_activity_Had a Phone Conversation',
'last_activity_Olark Chat Conversation',
'last_activity_Page Visited on Website',
'last_activity_Resubscribed to emails', 'last_activity_SMS Sent',
'last_activity_Unreachable', 'last_activity_Unsubscribed',
'last_activity_View in browser link Clicked',
'last_activity_Visited Booth in Tradeshow',
'lead_origin_Landing Page Submission', 'lead_origin_Lead Add Form',
'lead_origin_Lead Import', 'lead_origin_Quick Add Form',
'lead_source_Direct Traffic', 'lead_source_Facebook',
'lead_source_Google', 'lead_source_Live Chat', 'lead_source_NC_EDM',
'lead_source_Olark Chat', 'lead_source_Organic Search',
'lead_source_Pay per Click Ads', 'lead_source_Press_Release',
'lead_source_Reference', 'lead_source_Referral Sites',
'lead_source_Social Media', 'lead_source_WeLearn',
'lead_source_Welingak Website', 'lead_source_bing', 'lead_source_blog',
'lead_source_google', 'lead_source_testone',
'lead_source_welearnblog_Home', 'lead_source_youtubechannel',
'tags_Already a student', 'tags_Busy', 'tags_Closed by Horizzon',
'tags_Diploma holder (Not Eligible)', 'tags_Graduation in progress',
'tags_In confusion whether part time or DLP',
'tags_Interested in full time MBA', 'tags_Interested in Next batch',
'tags_Interested in other courses', 'tags_Lateral student',
'tags_Lost to EINS', 'tags_Lost to Others',
'tags_Not doing further education',
'tags_Recognition issue (DEC approval)', 'tags_Ringing',
'tags_Shall take in the next coming month', 'tags_Still Thinking',
'tags_University not recognized',
'tags_Want to take admission but has financial problems',
'tags_Will revert after reading the email', 'tags_in touch with EINS',
'tags_invalid number', 'tags_number not provided', 'tags_opp hangup',
'tags_switched off', 'tags_wrong number given',
'specialization_Banking, Investment And Insurance',
'specialization_Business Administration', 'specialization_E-Business',
'specialization_E-COMMERCE', 'specialization_Finance Management',
'specialization_Healthcare Management',
'specialization_Hospitality Management',
'specialization_Human Resource Management',
'specialization_IT Projects Management',
'specialization_International Business',
'specialization_Marketing Management',
'specialization_Media and Advertising',
'specialization_Operations Management',
'specialization_Retail Management',
'specialization_Rural and Agribusiness',
'specialization_Services Excellence',
'specialization_Supply Chain Management',
'specialization_Travel and Tourism'],
dtype='object')
| totalvisits | ttime_on_site | pg_view_pv | avail_free_copy_1 | city_Other Cities | city_Other Cities of Maharashtra | city_Other Metro Cities | city_Thane & Outskirts | city_Tier II Cities | curr_occupation_Housewife | curr_occupation_Other | curr_occupation_Student | curr_occupation_Unemployed | curr_occupation_Working Professional | do_not_email_1 | last_activity_Converted to Lead | last_activity_Email Bounced | last_activity_Email Link Clicked | last_activity_Email Marked Spam | last_activity_Email Opened | last_activity_Email Received | last_activity_Form Submitted on Website | last_activity_Had a Phone Conversation | last_activity_Olark Chat Conversation | last_activity_Page Visited on Website | last_activity_Resubscribed to emails | last_activity_SMS Sent | last_activity_Unreachable | last_activity_Unsubscribed | last_activity_View in browser link Clicked | last_activity_Visited Booth in Tradeshow | lead_origin_Landing Page Submission | lead_origin_Lead Add Form | lead_origin_Lead Import | lead_origin_Quick Add Form | lead_source_Direct Traffic | lead_source_Facebook | lead_source_Google | lead_source_Live Chat | lead_source_NC_EDM | lead_source_Olark Chat | lead_source_Organic Search | lead_source_Pay per Click Ads | lead_source_Press_Release | lead_source_Reference | lead_source_Referral Sites | lead_source_Social Media | lead_source_WeLearn | lead_source_Welingak Website | lead_source_bing | lead_source_blog | lead_source_google | lead_source_testone | lead_source_welearnblog_Home | lead_source_youtubechannel | tags_Already a student | tags_Busy | tags_Closed by Horizzon | tags_Diploma holder (Not Eligible) | tags_Graduation in progress | tags_In confusion whether part time or DLP | tags_Interested in full time MBA | tags_Interested in Next batch | tags_Interested in other courses | tags_Lateral student | tags_Lost to EINS | tags_Lost to Others | tags_Not doing further education | tags_Recognition issue (DEC approval) | tags_Ringing | tags_Shall take in the next coming month | tags_Still Thinking | tags_University not recognized | tags_Want to take admission but has financial problems | tags_Will revert after reading the email | tags_in touch with EINS | tags_invalid number | tags_number not provided | tags_opp hangup | tags_switched off | tags_wrong number given | specialization_Banking, Investment And Insurance | specialization_Business Administration | specialization_E-Business | specialization_E-COMMERCE | specialization_Finance Management | specialization_Healthcare Management | specialization_Hospitality Management | specialization_Human Resource Management | specialization_IT Projects Management | specialization_International Business | specialization_Marketing Management | specialization_Media and Advertising | specialization_Operations Management | specialization_Retail Management | specialization_Rural and Agribusiness | specialization_Services Excellence | specialization_Supply Chain Management | specialization_Travel and Tourism | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7740 | 0.091 | 0.124 | 0.167 | 1.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 7014 | 0.545 | 0.426 | 0.333 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 2431 | 0.182 | 0.133 | 0.333 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 3012 | 0.727 | 0.585 | 0.667 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 3140 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
----------------------------------------------------------------------¶
Model Building¶
Custom Functions for Model Training¶
# We create custom functions for model veiling since iteration we reuse certain functions again and again
# Train and predict function trains the model and predicts on the same data and returns the model its probability and predicted values based on cutoff
# The matrix function returns confusion matrix and accuracy score
# The vif function returns the vif score for the features
def logreg_train_pred_fn(fX_train, fy_train, fcol, fcutoff):
fX_train_sm = sm.add_constant(fX_train[fcol])
flogm = sm.GLM(fy_train, fX_train_sm, family = sm.families.Binomial())
fres = flogm.fit()
fy_train_pred = fres.predict(fX_train_sm)
fy_train_pred = fy_train_pred.values.reshape(-1)
fy_train_pred_final = pd.DataFrame({'Converted':fy_train.values, 'Conv_Prob':fy_train_pred})
fy_train_pred_final['ID'] = fy_train.index
fy_train_pred_final['predicted'] = fy_train_pred_final.Conv_Prob.map(lambda x: 1 if x > fcutoff else 0)
return fres, fy_train_pred,fy_train_pred_final
def logreg_metrics_fn(fy_train_pred_final):
fconfusion = confusion_matrix(fy_train_pred_final.Converted, fy_train_pred_final.predicted )
faccuracy = accuracy_score(fy_train_pred_final.Converted, fy_train_pred_final.predicted)
return fconfusion, faccuracy
def logreg_VIF_score_fn(fX_train, fcol):
fvif = pd.DataFrame()
fvif['Features'] = fX_train[fcol].columns
fvif['VIF'] = [variance_inflation_factor(fX_train[fcol].values, i) for i in range(fX_train[fcol].shape[1])]
fvif['VIF'] = round(fvif['VIF'], 2)
fvif = fvif.sort_values(by = "VIF", ascending = False)
return fvif
Base Model 1¶
# Logistic regression model
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
res = logm1.fit()
# res.summary()
RFE - Recursive Feature Elimination¶
# Since the data set has a lot of features we perform rfe to eliminate insignificant features
logreg = LogisticRegression()
rfe = RFE(estimator=logreg, n_features_to_select=15) # running RFE with 15 variables as output
rfe = rfe.fit(X_train, y_train)
rfe_feature_Ranking = list(zip(X_train.columns, rfe.support_, rfe.ranking_))
rfe_sorted = sorted(rfe_feature_Ranking, key=lambda x : x[2])
rfe_sorted[:15]
[('ttime_on_site', True, 1),
('last_activity_SMS Sent', True, 1),
('lead_source_Welingak Website', True, 1),
('tags_Already a student', True, 1),
('tags_Closed by Horizzon', True, 1),
('tags_Interested in full time MBA', True, 1),
('tags_Interested in other courses', True, 1),
('tags_Lost to EINS', True, 1),
('tags_Not doing further education', True, 1),
('tags_Ringing', True, 1),
('tags_Will revert after reading the email', True, 1),
('tags_invalid number', True, 1),
('tags_number not provided', True, 1),
('tags_switched off', True, 1),
('tags_wrong number given', True, 1)]
col = X_train.columns[rfe.support_]
# X_train.columns[~rfe.support_]
Model 1¶
# Now we perform model iteration as many times as possible till we get an optimum result
cutoff = 0.5
res, y_train_pred,y_train_pred_final = logreg_train_pred_fn(X_train, y_train, col, cutoff)
confusion, accuracy = logreg_metrics_fn(y_train_pred_final)
vif = logreg_VIF_score_fn(X_train, col)
print('Model Summary:') # Model Summary:
res.summary()
print('\nVIF Score:') # VIF Score:
vif
# print('\nY_Predicted Values:') # Y_Predicted Values:
# y_train_pred
# print('\nY_Predicted Cutoff:') # Y_Predicted Cutoff:
# y_train_pred_final
# print('\nConfusion Matrix:') # Confusion Matrix:
# confusion
# print(f'\nAccuracy Score: {accuracy}\n') # Accuracy Score:
Model Summary:
| Dep. Variable: | converted | No. Observations: | 7282 |
|---|---|---|---|
| Model: | GLM | Df Residuals: | 7266 |
| Model Family: | Binomial | Df Model: | 15 |
| Link Function: | Logit | Scale: | 1.0000 |
| Method: | IRLS | Log-Likelihood: | -1654.3 |
| Date: | Sun, 20 Oct 2024 | Deviance: | 3308.7 |
| Time: | 23:04:34 | Pearson chi2: | 7.92e+03 |
| No. Iterations: | 23 | Pseudo R-squ. (CS): | 0.5831 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -2.5738 | 0.080 | -32.306 | 0.000 | -2.730 | -2.418 |
| ttime_on_site | 3.3673 | 0.186 | 18.083 | 0.000 | 3.002 | 3.732 |
| last_activity_SMS Sent | 1.9695 | 0.097 | 20.333 | 0.000 | 1.780 | 2.159 |
| lead_source_Welingak Website | 5.9595 | 1.015 | 5.869 | 0.000 | 3.969 | 7.950 |
| tags_Already a student | -3.4976 | 0.714 | -4.898 | 0.000 | -4.897 | -2.098 |
| tags_Closed by Horizzon | 6.6849 | 0.715 | 9.351 | 0.000 | 5.284 | 8.086 |
| tags_Interested in full time MBA | -2.3781 | 0.752 | -3.163 | 0.002 | -3.852 | -0.904 |
| tags_Interested in other courses | -2.0603 | 0.311 | -6.625 | 0.000 | -2.670 | -1.451 |
| tags_Lost to EINS | 5.0236 | 0.519 | 9.687 | 0.000 | 4.007 | 6.040 |
| tags_Not doing further education | -3.1741 | 1.011 | -3.139 | 0.002 | -5.156 | -1.192 |
| tags_Ringing | -2.9089 | 0.204 | -14.230 | 0.000 | -3.310 | -2.508 |
| tags_Will revert after reading the email | 4.5600 | 0.158 | 28.854 | 0.000 | 4.250 | 4.870 |
| tags_invalid number | -3.4463 | 1.073 | -3.211 | 0.001 | -5.550 | -1.343 |
| tags_number not provided | -24.2221 | 2.73e+04 | -0.001 | 0.999 | -5.36e+04 | 5.36e+04 |
| tags_switched off | -3.2715 | 0.519 | -6.306 | 0.000 | -4.288 | -2.255 |
| tags_wrong number given | -23.8951 | 1.93e+04 | -0.001 | 0.999 | -3.78e+04 | 3.77e+04 |
VIF Score:
| Features | VIF | |
|---|---|---|
| 0 | ttime_on_site | 1.650 |
| 10 | tags_Will revert after reading the email | 1.530 |
| 1 | last_activity_SMS Sent | 1.480 |
| 9 | tags_Ringing | 1.120 |
| 4 | tags_Closed by Horizzon | 1.050 |
| 7 | tags_Lost to EINS | 1.040 |
| 2 | lead_source_Welingak Website | 1.030 |
| 13 | tags_switched off | 1.030 |
| 6 | tags_Interested in other courses | 1.020 |
| 3 | tags_Already a student | 1.010 |
| 5 | tags_Interested in full time MBA | 1.010 |
| 8 | tags_Not doing further education | 1.010 |
| 11 | tags_invalid number | 1.010 |
| 14 | tags_wrong number given | 1.010 |
| 12 | tags_number not provided | 1.000 |
Model 2¶
col = col.drop('tags_number not provided', 1)
col
cutoff = 0.5
res, y_train_pred,y_train_pred_final = logreg_train_pred_fn(X_train, y_train, col, cutoff)
confusion, accuracy = logreg_metrics_fn(y_train_pred_final)
vif = logreg_VIF_score_fn(X_train, col)
print('Model Summary:') # Model Summary:
res.summary()
print('\nVIF Score:') # VIF Score:
vif
# print('\nY_Predicted Values:') # Y_Predicted Values:
# y_train_pred
# print('\nY_Predicted Cutoff:') # Y_Predicted Cutoff:
# y_train_pred_final
# print('\nConfusion Matrix:') # Confusion Matrix:
# confusion
# print(f'\nAccuracy Score: {accuracy}\n') # Accuracy Score:
Index(['ttime_on_site', 'last_activity_SMS Sent',
'lead_source_Welingak Website', 'tags_Already a student',
'tags_Closed by Horizzon', 'tags_Interested in full time MBA',
'tags_Interested in other courses', 'tags_Lost to EINS',
'tags_Not doing further education', 'tags_Ringing',
'tags_Will revert after reading the email', 'tags_invalid number',
'tags_switched off', 'tags_wrong number given'],
dtype='object')
Model Summary:
| Dep. Variable: | converted | No. Observations: | 7282 |
|---|---|---|---|
| Model: | GLM | Df Residuals: | 7267 |
| Model Family: | Binomial | Df Model: | 14 |
| Link Function: | Logit | Scale: | 1.0000 |
| Method: | IRLS | Log-Likelihood: | -1662.5 |
| Date: | Sun, 20 Oct 2024 | Deviance: | 3325.0 |
| Time: | 23:04:34 | Pearson chi2: | 7.89e+03 |
| No. Iterations: | 22 | Pseudo R-squ. (CS): | 0.5822 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -2.5765 | 0.080 | -32.391 | 0.000 | -2.732 | -2.421 |
| ttime_on_site | 3.3419 | 0.186 | 18.012 | 0.000 | 2.978 | 3.706 |
| last_activity_SMS Sent | 1.9566 | 0.096 | 20.279 | 0.000 | 1.768 | 2.146 |
| lead_source_Welingak Website | 5.9655 | 1.015 | 5.875 | 0.000 | 3.975 | 7.956 |
| tags_Already a student | -3.4854 | 0.714 | -4.882 | 0.000 | -4.885 | -2.086 |
| tags_Closed by Horizzon | 6.6906 | 0.715 | 9.359 | 0.000 | 5.289 | 8.092 |
| tags_Interested in full time MBA | -2.3630 | 0.751 | -3.147 | 0.002 | -3.835 | -0.891 |
| tags_Interested in other courses | -2.0456 | 0.311 | -6.584 | 0.000 | -2.655 | -1.437 |
| tags_Lost to EINS | 5.0308 | 0.518 | 9.703 | 0.000 | 4.015 | 6.047 |
| tags_Not doing further education | -3.1612 | 1.011 | -3.127 | 0.002 | -5.143 | -1.180 |
| tags_Ringing | -2.8871 | 0.204 | -14.147 | 0.000 | -3.287 | -2.487 |
| tags_Will revert after reading the email | 4.5690 | 0.158 | 28.930 | 0.000 | 4.259 | 4.879 |
| tags_invalid number | -3.4238 | 1.072 | -3.195 | 0.001 | -5.524 | -1.324 |
| tags_switched off | -3.2511 | 0.519 | -6.270 | 0.000 | -4.267 | -2.235 |
| tags_wrong number given | -22.8750 | 1.17e+04 | -0.002 | 0.998 | -2.29e+04 | 2.29e+04 |
VIF Score:
| Features | VIF | |
|---|---|---|
| 0 | ttime_on_site | 1.650 |
| 10 | tags_Will revert after reading the email | 1.530 |
| 1 | last_activity_SMS Sent | 1.470 |
| 9 | tags_Ringing | 1.120 |
| 4 | tags_Closed by Horizzon | 1.050 |
| 2 | lead_source_Welingak Website | 1.030 |
| 7 | tags_Lost to EINS | 1.030 |
| 12 | tags_switched off | 1.030 |
| 6 | tags_Interested in other courses | 1.020 |
| 3 | tags_Already a student | 1.010 |
| 5 | tags_Interested in full time MBA | 1.010 |
| 8 | tags_Not doing further education | 1.010 |
| 11 | tags_invalid number | 1.010 |
| 13 | tags_wrong number given | 1.010 |
Model 3¶
col = col.drop('tags_wrong number given', 1)
col
cutoff = 0.5
res, y_train_pred,y_train_pred_final = logreg_train_pred_fn(X_train, y_train, col, cutoff)
confusion, accuracy = logreg_metrics_fn(y_train_pred_final)
vif = logreg_VIF_score_fn(X_train, col)
print('Model Summary:') # Model Summary:
res.summary()
print('\nVIF Score:') # VIF Score:
vif
# print('\nY_Predicted Values:') # Y_Predicted Values:
# y_train_pred
# print('\nY_Predicted Cutoff:') # Y_Predicted Cutoff:
# y_train_pred_final
print('\nConfusion Matrix:') # Confusion Matrix:
confusion
print(f'\nAccuracy Score: {accuracy}\n') # Accuracy Score:
Index(['ttime_on_site', 'last_activity_SMS Sent',
'lead_source_Welingak Website', 'tags_Already a student',
'tags_Closed by Horizzon', 'tags_Interested in full time MBA',
'tags_Interested in other courses', 'tags_Lost to EINS',
'tags_Not doing further education', 'tags_Ringing',
'tags_Will revert after reading the email', 'tags_invalid number',
'tags_switched off'],
dtype='object')
Model Summary:
| Dep. Variable: | converted | No. Observations: | 7282 |
|---|---|---|---|
| Model: | GLM | Df Residuals: | 7268 |
| Model Family: | Binomial | Df Model: | 13 |
| Link Function: | Logit | Scale: | 1.0000 |
| Method: | IRLS | Log-Likelihood: | -1677.1 |
| Date: | Sun, 20 Oct 2024 | Deviance: | 3354.3 |
| Time: | 23:04:34 | Pearson chi2: | 7.86e+03 |
| No. Iterations: | 8 | Pseudo R-squ. (CS): | 0.5805 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -2.5895 | 0.079 | -32.593 | 0.000 | -2.745 | -2.434 |
| ttime_on_site | 3.3369 | 0.185 | 18.080 | 0.000 | 2.975 | 3.699 |
| last_activity_SMS Sent | 1.9243 | 0.096 | 20.103 | 0.000 | 1.737 | 2.112 |
| lead_source_Welingak Website | 5.9866 | 1.015 | 5.897 | 0.000 | 3.997 | 7.976 |
| tags_Already a student | -3.4670 | 0.714 | -4.856 | 0.000 | -4.866 | -2.068 |
| tags_Closed by Horizzon | 6.7048 | 0.715 | 9.379 | 0.000 | 5.304 | 8.106 |
| tags_Interested in full time MBA | -2.3393 | 0.750 | -3.120 | 0.002 | -3.809 | -0.870 |
| tags_Interested in other courses | -2.0230 | 0.310 | -6.519 | 0.000 | -2.631 | -1.415 |
| tags_Lost to EINS | 5.0457 | 0.518 | 9.733 | 0.000 | 4.030 | 6.062 |
| tags_Not doing further education | -3.1410 | 1.011 | -3.107 | 0.002 | -5.122 | -1.160 |
| tags_Ringing | -2.8475 | 0.204 | -13.985 | 0.000 | -3.247 | -2.448 |
| tags_Will revert after reading the email | 4.5883 | 0.158 | 29.075 | 0.000 | 4.279 | 4.898 |
| tags_invalid number | -3.3843 | 1.070 | -3.162 | 0.002 | -5.482 | -1.286 |
| tags_switched off | -3.2113 | 0.518 | -6.196 | 0.000 | -4.227 | -2.196 |
VIF Score:
| Features | VIF | |
|---|---|---|
| 0 | ttime_on_site | 1.650 |
| 10 | tags_Will revert after reading the email | 1.530 |
| 1 | last_activity_SMS Sent | 1.470 |
| 9 | tags_Ringing | 1.120 |
| 4 | tags_Closed by Horizzon | 1.050 |
| 2 | lead_source_Welingak Website | 1.030 |
| 7 | tags_Lost to EINS | 1.030 |
| 12 | tags_switched off | 1.030 |
| 6 | tags_Interested in other courses | 1.020 |
| 3 | tags_Already a student | 1.010 |
| 5 | tags_Interested in full time MBA | 1.010 |
| 8 | tags_Not doing further education | 1.010 |
| 11 | tags_invalid number | 1.010 |
Confusion Matrix:
array([[4334, 172],
[ 487, 2289]])
Accuracy Score: 0.9095028838231255
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
# Calculate the sensitivity
TP/(TP+FN)
# Calculate the specificity
TN/(TN+FP)
0.8245677233429395
0.9618286728806036
----------------------------------------------------------------------¶
Finding Optimal Cutoff Point¶
# Let's create columns with different probability cutoffs
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
y_train_pred_final[i]= y_train_pred_final.Conv_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()
| Converted | Conv_Prob | ID | predicted | 0.000 | 0.100 | 0.200 | 0.300 | 0.400 | 0.500 | 0.600 | 0.700 | 0.800 | 0.900 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.004 | 7740 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0.968 | 7014 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 0 | 0.005 | 2431 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1 | 0.981 | 3012 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 4 | 0 | 0.070 | 3140 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
cm1 = confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
total1=sum(sum(cm1))
accuracy = (cm1[0,0]+cm1[1,1])/total1
speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)
prob accuracy sensi speci 0.000 0.000 0.381 1.000 0.000 0.100 0.100 0.827 0.953 0.750 0.200 0.200 0.891 0.936 0.863 0.300 0.300 0.896 0.914 0.884 0.400 0.400 0.900 0.846 0.933 0.500 0.500 0.910 0.825 0.962 0.600 0.600 0.910 0.811 0.970 0.700 0.700 0.905 0.792 0.974 0.800 0.800 0.896 0.754 0.983 0.900 0.900 0.868 0.670 0.990
# Let's plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()
<Axes: xlabel='prob'>
y_train_pred_final['final_predicted'] = y_train_pred_final.Conv_Prob.map( lambda x: 1 if x > 0.34 else 0)
y_train_pred_final.head()
| Converted | Conv_Prob | ID | predicted | 0.000 | 0.100 | 0.200 | 0.300 | 0.400 | 0.500 | 0.600 | 0.700 | 0.800 | 0.900 | final_predicted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.004 | 7740 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0.968 | 7014 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 0 | 0.005 | 2431 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1 | 0.981 | 3012 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 4 | 0 | 0.070 | 3140 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Let's check the overall accuracy.
accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)
0.8893161219445207
confusion = confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
confusion
array([[4061, 445],
[ 361, 2415]])
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
# Calculate the sensitivity
TP/(TP+FN)
# Calculate the specificity
TN/(TN+FP)
0.8699567723342939
0.901242787394585
ROC Curve and Precision - Recall Curve¶
def draw_roc( actual, probs ):
fpr, tpr, thresholds = roc_curve( actual, probs, drop_intermediate = False )
auc_score = roc_auc_score( actual, probs )
plt.figure(figsize=(5, 5))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
return None
fpr, tpr, thresholds = roc_curve( y_train_pred_final.Converted, y_train_pred_final.Conv_Prob, drop_intermediate = False )
draw_roc(y_train_pred_final.Converted, y_train_pred_final.Conv_Prob)
precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
0.9301097114993905
0.8245677233429395
p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Conv_Prob)
plt.plot(thresholds, p[:-1], "b")
plt.plot(thresholds, r[:-1], "r")
plt.title('Precision Recall Curve')
plt.show();
y_train_pred_final['final_predicted'] = y_train_pred_final.Conv_Prob.map( lambda x: 1 if x > 0.37 else 0)
y_train_pred_final.head()
| Converted | Conv_Prob | ID | predicted | 0.000 | 0.100 | 0.200 | 0.300 | 0.400 | 0.500 | 0.600 | 0.700 | 0.800 | 0.900 | final_predicted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.004 | 7740 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0.968 | 7014 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 0 | 0.005 | 2431 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1 | 0.981 | 3012 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 4 | 0 | 0.070 | 3140 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Let's check the overall accuracy.
accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)
0.8950837681955507
confusion = confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
confusion
array([[4141, 365],
[ 399, 2377]])
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
# Calculate the sensitivity
TP/(TP+FN)
# Calculate the specificity
TN/(TN+FP)
precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
0.8562680115273775
0.9189968930315136
0.9301097114993905
0.8245677233429395
----------------------------------------------------------------------¶
Predictions on the test set¶
Custom Functions for Test¶
def logreg_test_pred_fn(fX_test, fy_test, fcol, fcutoff, fres):
fX_test_sm = sm.add_constant(fX_test[fcol])
fy_test_pred = fres.predict(fX_test_sm)
fy_test_pred = fy_test_pred.values.reshape(-1)
fy_test_pred_final = pd.DataFrame({'Converted':fy_test.values, 'Conv_Prob':fy_test_pred})
fy_test_pred_final['ID'] = fy_test.index
fy_test_pred_final['predicted'] = fy_test_pred_final.Conv_Prob.map(lambda x: 1 if x > fcutoff else 0)
return fres, fy_test_pred,fy_test_pred_final
def logreg_test_metrics_fn(fy_test_pred_final):
fconfusion = confusion_matrix(fy_test_pred_final.Converted, fy_test_pred_final.predicted )
faccuracy = accuracy_score(fy_test_pred_final.Converted, fy_test_pred_final.predicted)
return fconfusion, faccuracy
def logreg_test_VIF_score_fn(fX_test, fcol):
fvif = pd.DataFrame()
fvif['Features'] = fX_test[fcol].columns
fvif['VIF'] = [variance_inflation_factor(fX_test[fcol].values, i) for i in
range(fX_test[fcol].shape[1])]
fvif['VIF'] = round(fvif['VIF'], 2)
fvif = fvif.sort_values(by = "VIF", ascending = False)
return fvif
Model Validation on Test Data¶
X_test[to_scale] = scaler.transform(X_test[to_scale])
X_test[col].head()
| ttime_on_site | last_activity_SMS Sent | lead_source_Welingak Website | tags_Already a student | tags_Closed by Horizzon | tags_Interested in full time MBA | tags_Interested in other courses | tags_Lost to EINS | tags_Not doing further education | tags_Ringing | tags_Will revert after reading the email | tags_invalid number | tags_switched off | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7421 | 0.093 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 4570 | 0.116 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 5302 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 6471 | 0.143 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| 8958 | 0.385 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
cutoff = 0.37
res, y_test_pred, y_test_pred_final = logreg_test_pred_fn(X_test, y_test, col, cutoff, res)
confusion, accuracy = logreg_test_metrics_fn(y_test_pred_final)
vif = logreg_test_VIF_score_fn(X_test, col)
print('Model Summary:') # Model Summary:
res.summary()
print('\nVIF Score:') # VIF Score:
vif
# print('\nY_Predicted Values:') # Y_Predicted Values:
# y_test_pred
# print('\nY_Predicted Cutoff:') # Y_Predicted Cutoff:
# y_test_pred_final
Model Summary:
| Dep. Variable: | converted | No. Observations: | 7282 |
|---|---|---|---|
| Model: | GLM | Df Residuals: | 7268 |
| Model Family: | Binomial | Df Model: | 13 |
| Link Function: | Logit | Scale: | 1.0000 |
| Method: | IRLS | Log-Likelihood: | -1677.1 |
| Date: | Sun, 20 Oct 2024 | Deviance: | 3354.3 |
| Time: | 23:04:35 | Pearson chi2: | 7.86e+03 |
| No. Iterations: | 8 | Pseudo R-squ. (CS): | 0.5805 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -2.5895 | 0.079 | -32.593 | 0.000 | -2.745 | -2.434 |
| ttime_on_site | 3.3369 | 0.185 | 18.080 | 0.000 | 2.975 | 3.699 |
| last_activity_SMS Sent | 1.9243 | 0.096 | 20.103 | 0.000 | 1.737 | 2.112 |
| lead_source_Welingak Website | 5.9866 | 1.015 | 5.897 | 0.000 | 3.997 | 7.976 |
| tags_Already a student | -3.4670 | 0.714 | -4.856 | 0.000 | -4.866 | -2.068 |
| tags_Closed by Horizzon | 6.7048 | 0.715 | 9.379 | 0.000 | 5.304 | 8.106 |
| tags_Interested in full time MBA | -2.3393 | 0.750 | -3.120 | 0.002 | -3.809 | -0.870 |
| tags_Interested in other courses | -2.0230 | 0.310 | -6.519 | 0.000 | -2.631 | -1.415 |
| tags_Lost to EINS | 5.0457 | 0.518 | 9.733 | 0.000 | 4.030 | 6.062 |
| tags_Not doing further education | -3.1410 | 1.011 | -3.107 | 0.002 | -5.122 | -1.160 |
| tags_Ringing | -2.8475 | 0.204 | -13.985 | 0.000 | -3.247 | -2.448 |
| tags_Will revert after reading the email | 4.5883 | 0.158 | 29.075 | 0.000 | 4.279 | 4.898 |
| tags_invalid number | -3.3843 | 1.070 | -3.162 | 0.002 | -5.482 | -1.286 |
| tags_switched off | -3.2113 | 0.518 | -6.196 | 0.000 | -4.227 | -2.196 |
VIF Score:
| Features | VIF | |
|---|---|---|
| 0 | ttime_on_site | 1.730 |
| 10 | tags_Will revert after reading the email | 1.590 |
| 1 | last_activity_SMS Sent | 1.550 |
| 9 | tags_Ringing | 1.110 |
| 2 | lead_source_Welingak Website | 1.040 |
| 7 | tags_Lost to EINS | 1.040 |
| 4 | tags_Closed by Horizzon | 1.030 |
| 12 | tags_switched off | 1.030 |
| 3 | tags_Already a student | 1.020 |
| 6 | tags_Interested in other courses | 1.020 |
| 5 | tags_Interested in full time MBA | 1.010 |
| 8 | tags_Not doing further education | 1.010 |
| 11 | tags_invalid number | 1.010 |
print('\nConfusion Matrix:') # Confusion Matrix:
confusion
print(f'\nAccuracy Score: {accuracy}\n') # Accuracy Score:
Confusion Matrix:
array([[1045, 100],
[ 76, 600]])
Accuracy Score: 0.9033498077979132
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
# Calculate the sensitivity
TP/(TP+FN)
# Calculate the specificity
TN/(TN+FP)
precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
0.8875739644970414
0.9126637554585153
0.9301097114993905
0.8245677233429395
Approach - 03 - Accuracy Score:¶
print(f'\nAccuracy Score: {accuracy}\n') # Accuracy Score:
Accuracy Score: 0.9033498077979132
------------------------------------------------------¶
Cross Validation Testing¶
logreg = LogisticRegression()
res = logreg.fit(X_train[col],y_train)
scores = cross_val_score(res, X_train[col], y_train, cv=5, scoring='accuracy')
np.sqrt(np.abs(scores))
scores = cross_val_score(res, X_test[col], y_test, cv=5, scoring='accuracy')
np.sqrt(np.abs(scores))
array([0.95398597, 0.96258061, 0.94563776, 0.95790522, 0.94962399])
array([0.97080675, 0.96076892, 0.96076892, 0.96504854, 0.95070824])